Accurate object geometry estimation is essential for many downstream tasks, including robotic manipulation and physical interaction. Although vision is the dominant modality for shape perception, it becomes unreliable under occlusions or challenging lighting conditions. In such scenarios, tactile sensing provides direct geometric information through physical contact.
However, reconstructing global 3D geometry from sparse local touches alone is fundamentally underconstrained. We present TouchAnything, a framework that leverages a pretrained large-scale 2D vision diffusion model as a semantic and geometric prior for 3D reconstruction from sparse tactile measurements.
Unlike prior work that trains category-specific reconstruction networks or learns diffusion models directly from tactile data, we transfer the geometric knowledge encoded in pretrained visual diffusion models to the tactile domain. Given sparse contact constraints and a coarse class-level description of the object, we formulate reconstruction as an optimization problem that enforces tactile consistency while guiding solutions toward shapes consistent with the diffusion prior. Our method reconstructs accurate geometries from only a few touches, outperforms existing baselines, and enables open-world 3D reconstruction of previously unseen object instances.
Our pipeline derives local geometry from Gelsight sensors and integrates it with a general-purpose 2D Stable Diffusion prior. The reconstruction is performed in a coarse-to-fine manner.
Interact with our reconstructed objects below. Drag to rotate, scroll to zoom.
Real Object
Ground Truth
Reconstruction
Real Object
Ground Truth
Reconstruction
Real Object
Reconstruction
Real Object
Reconstruction
Comparison between Stage 1 (Coarse SDF) and Stage 2 (DMTet Refinement). The explicit representation allows the diffusion model to recover high-frequency surface details like the ridges on a lens hood or the textured skin of an avocado.
Class-level prompts are often sufficient, while incorrect prompts can hallucinate semantically wrong shapes. More touches generally improve reconstruction quality.
@article{gu2026touch,
title={TouchAnything: Diffusion-Guided 3D Reconstruction from Sparse Robot Touches},
author={Gu, Langzhe and Huang, Hung-Jui and Qadri, Mohamad and Kaess, Michael and Yuan, Wenzhen},
journal={arXiv preprint},
year={2026}
}