*Equal Contribution †Corresponding Authors
Minghuan did this work when he was at ByteDance Seed.
We release camera depth models for various widely-used depth cameras.
Try to plug them into your existing robot pipeline to improve the depth perception accuracy!
Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning, but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties—such as distance, size, and shape—than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise, or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated and slender objects, with little to no performance degradation. Further analysis reveals that sim-to-real success rates are strongly correlated with the quality of depth perception. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
We introduce ByteCameraDepth, a real-world multi-camera depth dataset comprising over 170,000 RGB-depth pairs from ten distinct configurations captured by seven depth cameras.
RGB-Depth Pairs
Depth Cameras
Configurations
RGB Image
Depth Image
Each Scene is Visualized in a Different Fixed Min-Max Range
A pilot study on real-world imitation learning with only geometric information for manipulation.
Simulation training and real-world deployment with D435 and L515 cameras (zero-shot transfer).
Simulation environment with camera pose randomization for demonstration generation
Randomization for bowl and microwave (articulated objects with transparent glass) positions
Randomization for fork (reflective and slender), plate (thin) and box positions
Real-world UR5 robot setup with third-view depth camera for manipulation tasks
Test distribution following simulation randomization boundaries for real-world experiments
Test distribution following simulation randomization boundaries for real-world experiments
Leverages CDMs to bridge the geometric gap between simulation and real-world.
** Note that our goal is NOT to identify if depth is a better visual modality than color , BUT to validate whether accurate geometry information contained in a more precise depth image can benefit manipulation. Therefore, the policies designed in our experiments are depth-only to exclude the effect of color information.
Depth Model | Toothpaste-and-Cup | Stack-Bowls | ||
---|---|---|---|---|
Pick Toothpaste | Put Toothpaste into Cup | Pick Bowl | Stack Bowls | |
None | 0/15 | 0/15 | 6/15 | 3/15 |
CDM-D435 | 10/15 | 6/15 | 11/15 | 9/15 |
50 demonstrations for each task, collected by teleoperation
Policies trained without CDM cannot generalize to unseen sizes (0 SR), while CDM-enhanced policies show better generalization.
Toothpaste-and-Cup Task
Stack-Bowls Task
Zero-shot sim-to-real results using CDMs as plugin in real-world robot pipeline.
* The policy is robust to external interruptions during the test.
Kitchen Task - Real Robot
Canteen Task - Real Robot
Cam RGB/Depth (D435,30FPS)
CDM-D435 Depth (~6FPS)
Cam RGB/Depth (L515,30FPS)
CDM-L515 Depth (~6FPS)
Camera | Depth Model | Kitchen Task | Canteen Task | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Pick Bowl | Put Bowl into Microwave | Close Microwave | Total | Pick Fork | Place Fork | Pick Plate | Dump Plate | Place Plate | Total | ||
Sim (D435-View) | None | 43/50 | 33/50 | 32/50 | 30/50 | 40/50 | 28/50 | 47/50 | 45/50 | 33/50 | 21/50 |
D435 | None | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 |
PromptDA | 11/30 | 5/30 | 0/30 | 0/30 | 17/30 | 16/30 | 7/30 | 2/30 | 6/30 | 1/30 | |
PriorDA | 16/30 | 8/30 | 7/30 | 7/30 | 30/30 | 30/30 | 1/30 | 0/30 | 0/30 | 0/30 | |
CDM-D435 | 29/30 | 26/30 | 26/30 | 26/30 | 30/30 | 30/30 | 15/30 | 14/30 | 14/30 | 14/30 | |
CDM-L515 | 29/30 | 22/30 | 16/30 | 14/30 | 30/30 | 29/30 | 0/30 | 0/30 | 0/30 | 0/30 | |
Sim (L515-View) | None | 43/50 | 34/50 | 37/50 | 32/50 | 40/50 | 26/50 | 46/50 | 43/50 | 31/50 | 20/50 |
L515 | None | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 |
PromptDA | 3/30 | 0/30 | 0/30 | 0/30 | 3/30 | 0/30 | 3/30 | 0/30 | 0/30 | 0/30 | |
PriorDA | 17/30 | 3/30 | 2/30 | 2/30 | 10/30 | 8/30 | 3/30 | 3/30 | 3/30 | 3/30 | |
CDM-D435 | 22/30 | 11/30 | 9/30 | 9/30 | 13/30 | 11/30 | 11/30 | 10/30 | 9/30 | 9/30 | |
CDM-L515 | 25/30 | 18/30 | 18/30 | 18/30 | 24/30 | 24/30 | 22/30 | 22/30 | 22/30 | 22/30 |
* Real-world videos are slightly speed up for alignment (~1.4x)
Kitchen Task - D435 Camera
Kitchen Task - L515 Camera
Canteen Task - D435 Camera
Canteen Task - L515 Camera
RGB Image (D435)
Raw Depth (D435)
CDM-D435 Output
RGB Image (L515)
Raw Depth (L515)
CDM-L515 Output
RGB Image (D435)
Raw Depth (D435)
CDM-D435 Output
RGB Image (L515)
Raw Depth (L515)
CDM-L515 Output
RGB Image (D435)
Raw Depth (D435)
CDM-D435 Output
RGB Image (D435)
Raw Depth (D435)
CDM-D435 Output
Interactive 3D visualization - Use mouse to rotate, scroll to zoom
*Point clouds are downsampled from 640x480 to 90000 points.
Raw Depth Point Cloud
Model A: CDM-D435
Model B: CDM-L515
* denotes the model fine-tuning on the same synthesized data as CDMs.
Split | Depth Model | L1 ↓ | RMSE ↓ | AbsRel ↓ | δ₀.₅ ↑ | δ₁ ↑ |
---|---|---|---|---|---|---|
D435 (IR Stereo) |
CDM-D435 (Ours) | 0.0258 | 0.0404 | 0.0312 | 0.9842 | 0.9951 |
CDM-L515 (Ours) | 0.0182 | 0.0338 | 0.0217 | 0.9877 | 0.9956 | |
PromptDA*(435) | 0.0434 | 0.0666 | 0.0599 | 0.9459 | 0.9770 | |
PromptDA*(515) | 0.1830 | 0.2387 | 0.2750 | 0.8802 | 0.9186 | |
PromptDA | 0.0396 | 0.0691 | 0.0484 | 0.9503 | 0.9772 | |
PriorDA | 0.0388 | 0.0754 | 0.0461 | 0.9632 | 0.9880 | |
Raw Depth | 0.0550 | 0.1458 | 0.0708 | 0.9179 | 0.9543 | |
L515 (D-ToF) |
CDM-L515 (Ours) | 0.0156 | 0.0297 | 0.0229 | 0.9754 | 0.9919 |
CDM-D435 (Ours) | 0.0165 | 0.0349 | 0.0246 | 0.9613 | 0.9855 | |
PromptDA*(515) | 0.0235 | 0.0666 | 0.0349 | 0.9291 | 0.9730 | |
PromptDA*(435) | 0.0254 | 0.0438 | 0.0379 | 0.9234 | 0.9640 | |
PromptDA | 0.0207 | 0.0515 | 0.0304 | 0.9480 | 0.9699 | |
PriorDA | 0.0177 | 0.0385 | 0.0274 | 0.9502 | 0.9763 | |
Raw Depth | 0.0312 | 0.0813 | 0.0475 | 0.9098 | 0.9429 |
To understand the working range of CDMs and help users effectively use them, we evaluated the depth accuracy of CDMs at various distances on the Hammer dataset. The results show that CDMs achieve high accuracy across different distances, with performance trends following the original camera capabilities while significantly reducing noise and errors.
Interactive 3D visualization of CDM processed point clouds from ByteCameraDepth dataset
*Point clouds are downsampled from 640x480 to 90000 points.
Raw Depth Point Cloud
CDM-D405 Point Cloud
RGB Image
Camera Depth / CDM Depth
We model depth camera noise patterns to generate high-quality paired data from simulation for training CDMs.
CDMs process RGB images and noisy depth signals from specific depth cameras to produce high-quality, denoised metric depth.
@article{liu2025manipulation,
title={Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots},
author={Liu, Minghuan and Zhu, Zhengbang and Han, Xiaoshen and Hu, Peng and Lin, Haotong and
Li, Xinyao and Chen, Jingxiao and Xu, Jiafeng and Yang, Yichu and Lin, Yunfeng and
Li, Xinghang and Yu, Yong and Zhang, Weinan and Kong, Tao and Kang, Bingyi},
journal={arXiv preprint},
year={2025}
}