TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches

1Tsinghua University, 2Carnegie Mellon University, 3University of Illinois Urbana-Champaign
*Equal contribution
TouchAnything Teaser

TouchAnything reconstructs highly detailed 3D object geometries from sparse physical touches by transferring visual generative priors from large-scale 2D diffusion models to the tactile domain.

Abstract

Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact.

However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements.

Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances.

Video

Methodology

Our pipeline derives local geometry from Gelsight sensors and integrates it with a general-purpose 2D Stable Diffusion prior. The reconstruction is performed in a coarse-to-fine manner.

TouchAnything Pipeline
  • Stage 1 (Coarse Geometry): We learn an SDF using a multi-resolution hash-grid and an MLP, supervised by tactile-derived depth/normals and Score Distillation Sampling (SDS) from a diffusion prior.
  • Stage 2 (Fine Geometry): We transfer the geometry to an explicit DMTet (tetrahedral grid) representation, enabling high-resolution (512x512) differentiable rendering. This allows the diffusion model to sculpt high-frequency geometric details and textures.

Open-World Reconstruction from 20 Touches

Interact with our reconstructed objects below. Drag to rotate, scroll to zoom.

Potted Meat Can (prompt: "a can")

Real Object

Real Potted Meat Can

Ground Truth

Reconstruction

Camera Model (prompt: "a camera")

Real Object

Real Camera Model

Ground Truth

Reconstruction

Power Drill (Prompt: "a drill")

Real Object

Real Drill

Reconstruction

Avocado (Prompt: "an avodaco")

Real Object

Real Avocado

Reconstruction

Refining Fine Geometries

Comparison between Stage 1 (Coarse SDF) and Stage 2 (DMTet Refinement). The explicit representation allows the diffusion model to recover high-frequency surface details like the ridges on a lens hood or the textured skin of an avocado.

Stage 1 vs Stage 2

Ablation Studies

Class-level prompts are often sufficient, while incorrect prompts can hallucinate semantically wrong shapes. More touches generally improve reconstruction quality.

Ablation

BibTeX

@article{gu2026touch,
  title={TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches},
  author={Gu, Langzhe and Huang, Hung-Jui and Qadri, Mohamad and Kaess, Michael and Yuan, Wenzhen},
  journal={arXiv preprint},
  year={2026}
}