Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

Abstract

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning, but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties—such as distance, size, and shape—than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise, or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated and slender objects, with little to no performance degradation. Further analysis reveals that sim-to-real success rates are strongly correlated with the quality of depth perception. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

UR5 Manipulation Experiment Setup

Imitation Learning with Only Depth

A pilot study on real-world imitation learning with only geometric information for manipulation.

Toothpaste-and-Cup Task

Stack-Bowls Task

Sim-to-Real Transfer

Simulation training and real-world deployment with D435 and L515 cameras (zero-shot transfer).

Overall Setup (Simulation)

Simulation environment with camera pose randomization for demonstration generation

Kitchen Task Randomization (Sim)

Randomization for bowl and microwave (articulated objects with transparent glass) positions

Canteen Task Randomization (Sim)

Randomization for fork (reflective and slender), plate (thin) and box positions

Overall Setup (Real)

Real-world UR5 robot setup with third-view depth camera for manipulation tasks

Kitchen Task Distribution (Real)

Test distribution following simulation randomization boundaries for real-world experiments

Canteen Task Distribution (Real)

Test distribution following simulation randomization boundaries for real-world experiments

Pipeline

Leverages CDMs to bridge the geometric gap between simulation and real-world.

Stage 1: Task&Scene Construction

Stage 2: Camera Calibration

Stage 3: Data Generation

Stage 4: Learning&Deployment

View Complete Tutorial

Manipulation Results

** Note that our goal is NOT to identify if depth is a better visual modality than color , BUT to validate whether accurate geometry information contained in a more precise depth image can benefit manipulation. Therefore, the policies designed in our experiments are depth-only to exclude the effect of color information.

A Pilot Study on Imitation Learning with Only Depth

Depth Model	Toothpaste-and-Cup		Stack-Bowls
	Pick Toothpaste	Put Toothpaste into Cup	Pick Bowl	Stack Bowls
	None	0/15	0/15	6/15	3/15
CDM-D435	10/15	6/15	11/15	9/15

Results w/w.o CDM

50 demonstrations for each task, collected by teleoperation

Generalization Over Different Object Sizes

Policies trained without CDM cannot generalize to unseen sizes (0 SR), while CDM-enhanced policies show better generalization.

Demo Videos

Toothpaste-and-Cup Task

Stack-Bowls Task

Zero-Shot Sim-to-Real Transfer Results

Zero-shot sim-to-real results using CDMs as plugin in real-world robot pipeline.

* The policy is robust to external interruptions during the test.

Demo Videos

Kitchen Task - Real Robot

Canteen Task - Real Robot

Cam RGB/Depth (D435,30FPS)

CDM-D435 Depth (~6FPS)

Cam RGB/Depth (L515,30FPS)

CDM-L515 Depth (~6FPS)

Camera	Depth Model	Kitchen Task				Canteen Task
		Pick Bowl	Put Bowl into Microwave	Close Microwave	Total	Pick Fork	Place Fork	Pick Plate	Dump Plate	Place Plate	Total
		Sim (D435-View)	None	43/50	33/50	32/50	30/50	40/50	28/50	47/50	45/50	33/50	21/50
D435	None	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30
	PromptDA	11/30	5/30	0/30	0/30	17/30	16/30	7/30	2/30	6/30	1/30
	PriorDA	16/30	8/30	7/30	7/30	30/30	30/30	1/30	0/30	0/30	0/30
	CDM-D435	29/30	26/30	26/30	26/30	30/30	30/30	15/30	14/30	14/30	14/30
	CDM-L515	29/30	22/30	16/30	14/30	30/30	29/30	0/30	0/30	0/30	0/30
Sim (L515-View)	None	43/50	34/50	37/50	32/50	40/50	26/50	46/50	43/50	31/50	20/50
L515	None	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30	0/30
	PromptDA	3/30	0/30	0/30	0/30	3/30	0/30	3/30	0/30	0/30	0/30
	PriorDA	17/30	3/30	2/30	2/30	10/30	8/30	3/30	3/30	3/30	3/30
	CDM-D435	22/30	11/30	9/30	9/30	13/30	11/30	11/30	10/30	9/30	9/30
	CDM-L515	25/30	18/30	18/30	18/30	24/30	24/30	22/30	22/30	22/30	22/30

Sim-Real Comparisons

* Real-world videos are slightly speed up for alignment (~1.4x)

Kitchen Task - D435 Camera

Kitchen Task - L515 Camera

Canteen Task - D435 Camera

Canteen Task - L515 Camera

How do CDMs work?

Download CDMs

Depth Quality Comparison

Kitchen Scene

RGB Image (D435)

Raw Depth (D435)

CDM-D435 Output

RGB Image (L515)

Raw Depth (L515)

CDM-L515 Output

Canteen Scene

RGB Image (D435)

Raw Depth (D435)

CDM-D435 Output

RGB Image (L515)

Raw Depth (L515)

CDM-L515 Output

Stack Bowls Scene

RGB Image (D435)

Raw Depth (D435)

CDM-D435 Output

Place Toothpaste Scene

RGB Image (D435)

Raw Depth (D435)

CDM-D435 Output

Point Cloud Quality Comparison

Interactive 3D visualization - Use mouse to rotate, scroll to zoom

Scene:

Camera:

Model A:

Model B:

Kitchen Scene - D435 Camera

*Point clouds are downsampled from 640x480 to 90000 points.

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-L515

Kitchen Scene - L515 Camera

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-L515

Canteen Scene - D435 Camera

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-L515

Canteen Scene - L515 Camera

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-L515

Stack Bowls Scene - D435 Camera

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-D435

Place Toothpaste Scene - D435 Camera

Raw Depth Point Cloud

Model A: CDM-D435

Model B: CDM-D435

Depth Evaluation on the Hammer Dataset (Zero-Shot)

* denotes the model fine-tuning on the same synthesized data as CDMs.

Split	Depth Model	L1 ↓	RMSE ↓	AbsRel ↓	δ₀.₅ ↑	δ₁ ↑
Split	Depth Model	D435 (IR Stereo)	CDM-D435 (Ours)	0.0258	0.0404	0.0312	0.9842	0.9951
CDM-L515 (Ours)	0.0182		0.0338	0.0217	0.9877	0.9956
PromptDA*(435)	0.0434		0.0666	0.0599	0.9459	0.9770
PromptDA*(515)	0.1830		0.2387	0.2750	0.8802	0.9186
PromptDA	0.0396		0.0691	0.0484	0.9503	0.9772
PriorDA	0.0388		0.0754	0.0461	0.9632	0.9880
Raw Depth	0.0550		0.1458	0.0708	0.9179	0.9543
L515 (D-ToF)	CDM-L515 (Ours)	0.0156	0.0297	0.0229	0.9754	0.9919
	CDM-D435 (Ours)	0.0165	0.0349	0.0246	0.9613	0.9855
	PromptDA*(515)	0.0235	0.0666	0.0349	0.9291	0.9730
	PromptDA*(435)	0.0254	0.0438	0.0379	0.9234	0.9640
	PromptDA	0.0207	0.0515	0.0304	0.9480	0.9699
	PriorDA	0.0177	0.0385	0.0274	0.9502	0.9763
	Raw Depth	0.0312	0.0813	0.0475	0.9098	0.9429

Depth Accuracy w.r.t Distance

To understand the working range of CDMs and help users effectively use them, we evaluated the depth accuracy of CDMs at various distances on the Hammer dataset. The results show that CDMs achieve high accuracy across different distances, with performance trends following the original camera capabilities while significantly reducing noise and errors.

Dataset Split:

D435 Split - Absolute Relative Error

D435 Split - L1 Error

L515 Split - Absolute Relative Error

L515 Split - L1 Error

Helios Split - Absolute Relative Error

Helios Split - L1 Error

Observations

Raw depth shows larger errors than manufacturer specifications (may be dataset bias)

CDMs maintain high accuracy within the camera's optimal working range

Performance trends follow the original camera capabilities while significantly reducing noise

Point Cloud Quality Comparison on ByteCameraDepth Dataset

Interactive 3D visualization of CDM processed point clouds from ByteCameraDepth dataset

Scene:

Camera:

*Point clouds are downsampled from 640x480 to 90000 points.

Raw Depth Point Cloud

CDM-D405 Point Cloud

RGB Image

Camera Depth / CDM Depth

Available Camera Depth Models

CDM-D405

CDM-D435

CDM-L515

CDM-Kinect

CDM-Zed2i N

CDM-Zed2i Q

Video Overview

Abstract

ByteCameraDepth Dataset

170K+

7

10

UR5 Manipulation Experiment Setup

Imitation Learning with Only Depth

Toothpaste-and-Cup Task

Stack-Bowls Task

Sim-to-Real Transfer

Overall Setup (Simulation)

Kitchen Task Randomization (Sim)

Canteen Task Randomization (Sim)

Overall Setup (Real)

Kitchen Task Distribution (Real)

Canteen Task Distribution (Real)

Pipeline

Manipulation Results

A Pilot Study on Imitation Learning with Only Depth

Results w/w.o CDM

Generalization Over Different Object Sizes

Demo Videos

Zero-Shot Sim-to-Real Transfer Results

Demo Videos

Sim-Real Comparisons

How do CDMs work?

Depth Quality Comparison

Kitchen Scene

Canteen Scene

Stack Bowls Scene

Place Toothpaste Scene

Point Cloud Quality Comparison

Camera View Parameters (Debug Mode)

Left Viewer (Raw Depth)

Right Viewer (CDM Output)

Kitchen Scene - D435 Camera

Kitchen Scene - L515 Camera

Canteen Scene - D435 Camera

Canteen Scene - L515 Camera

Stack Bowls Scene - D435 Camera

Place Toothpaste Scene - D435 Camera

Depth Evaluation on the Hammer Dataset (Zero-Shot)

Depth Accuracy w.r.t Distance

D435 Split - Absolute Relative Error

D435 Split - L1 Error

L515 Split - Absolute Relative Error

L515 Split - L1 Error

Helios Split - Absolute Relative Error

Helios Split - L1 Error

Observations

Point Cloud Quality Comparison on ByteCameraDepth Dataset

ByteCameraDepth Camera View Parameters (Debug Mode)

Raw Depth Viewer

CDM Predicted Viewer

Method Overview

Neural Data Engine

Camera Depth Models (CDMs)

Citation