General In-Hand Object Rotation with Vision and Touch

Abstract

We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and highlight the importance of visual and tactile sensing.

Many Objects, Many Axes

Interaction Visualization

We show interactive visualization for rotation over multiple axes. Try to use your mouse to control the viewing angle!

More on [X-axis] [Y-axis] [Z-axis] [Irregular-axis]

X-Axis, Toy Airplane

Y-Axis, Rubik's Cube

Z-Axis Pressure Cooker

[-1,-1,-1] Axis, Stanford Bunny

[0,-1,-1] Axis, EGAD Shape

[1,0,-1] Axis, Dice

Visual Sim-to-Real

Object Depth as the bridge of Simulation and Real-World.
In the real-world, use Segment-Anything to get object.
Ignore background both in Simulation and Real-World.

Tactile Sim-to-Real

Vision-based tactile sensing is hard to simulate. We use discrete contact location as the approximation.
In the simulation, we can directly query contact points.
In the real-world, we use simple color tracking to measure the pixel displacement, and then discretize it.

Vision and Touch Improve Manipulation of Hard Objects

We plot the relative improvements on varies objects shape for x-axis rotation. We find that point-cloud gives the largest improvement on objects with non-uniform w/d/h (width/depth/height) ratios and objects with irregular shapes such as the bunny and light bulb. The improvements on regular objects are smaller but still over 40%.

Similar to what we find in the oracle policy training, we observe the visuotactile policy has larger improvements on irregular and non-uniform objects.

Vision and Touch Improve Out-of-Distribution (OOD) Generalization

We show that not using point cloud results in a 22% decrease in generalization gap while using point-cloud can improve it to only 8% drop. Visuotactile information are critical for OOD generalization. Using proprioception only will lead to a 41% performance drop while using vision and touch can improve it to 15% drop.

Emergent Meaningful Latent Representation

After training, we freeze the policy and then try to predict 3D shapes from the learned embedding space.
In stage 1 (w/o shape) and stage 1 (w/ shape) comparison, we find shape information is preserved even the only learning signal is the task reward.
In stage 2 (proprioception only) and stage (visuotactile) comparison, we find the learned latent space can successfully reconstruct rough 3D shapes.

Bibtex

	  @inproceedings{qi2023general,
	   author={Qi, Haozhi and Yi, Brent and Suresh, Sudharshan and Lambeta, Mike and Ma, Yi and Calandra, Roberto and Malik, Jitendra},
	   title={{General In-Hand Object Rotation with Vision and Touch}},
	   booktitle={Conference on Robot Learning (CoRL)},
	   year={2023}
	  }

Acknowledgement

The interactive visualization and mesh visualization in paper are created by Viser.

This research was supported as a BAIR Open Research Common Project with Meta. In their academic roles at UC Berkeley, Haozhi Qi and Jitendra Malik are supported in part by DARPA Machine Common Sense (MCS), Brent Yi is supported by the NSF Graduate Research Fellowship Program under Grant DGE 2146752, and Haozhi Qi, Brent Yi, and Yi Ma are partially supported by ONR N00014-22-1-2102 and the InnoHK HKCRC grant. Roberto Calandra is funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden. We thank Shubham Goel, Eric Wallace, and Angjoo Kanazawa, Raunaq Bhirangi for their feedback. We thank Austin Wang and Tingfan Wu for their help on hardware. We thank Xinru Yang for her help on real-world videos.