Stanford University
Toyota Research Institute

Geometry-aware 4D Video Generation
for Robot Manipulation

1Stanford University      2Toyota Research Institute
We propose a 4D video generation model that enforces geometric consistency across multiple views to generate spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications on robot manipulation by extracting gripper poses from the generated videos with an off-the-shelf pose tracking algorithm.

Method

Method Overview

Our model takes RGB-D inputs from two camera viewpoints and predicts future pointmaps alongside RGB images. To enforce geometric consistency across views, we introduce cross-attention layers in the U-Net decoders for pointmap prediction. The resulting 4D video can then be used with off-the-shelf RGB-D pose tracking methods to recover the 6-DoF pose of the robot end-effector, enabling direct application to downstream manipulation tasks.


(a) StoreCerealBoxUnderShelf 🥣

In this task, a single robot arm picks up a cereal box from the top of a shelf and inserts it into the shelf below. Occlusions occur during insertion, especially from certain camera viewpoints, making multi-view predictions essential. Additionally, the pick-and-insert action requires spatial understanding and precision.

Ground-truth RGB-D videos
RGB
DEPTH
Generated RGB-D videos (unseen views)
RGB
DEPTH

Baselines:

OURS without cross-attention (RGB)
OURS without cross-attention (depth)
SVD
SVD w/ cross-attention

Policy Rollout (unseen views):

OURS Success
OURS Success
Diffusion Policy fails to insert the cereal box
Diffusion Policy fails to pick up the cereal box

(b) PutSpatulaOnTable 👩‍🍳

A single robot arm retrieves a spatula from a utensil crock and places it on the left side of the table. This task requires precise manipulation to successfully grasp the narrow object.

Ground-truth RGB-D videos
RGB
DEPTH
Generated RGB-D videos (unseen views)
RGB
DEPTH

Baselines:

OURS without cross-attention (RGB)
OURS without cross-attention (depth)
SVD
SVD with cross-attention

Policy Rollout (unseen views):

OURS Success
Diffusion Policy fails to grasp the spatula

(c) PlaceAppleFromBowlIntoBin 🍎

One robot arm picks up an apple from a bowl on the left side of the table and places it on a shelf; a second arm then picks up the apple and deposits it into a bin on the right side. This is a long-horizon, bimanual task that tests the model's ability to predict both temporally and spatially consistent trajectories.

Ground-truth RGB-D videos
RGB
DEPTH
Generated RGB-D videos (unseen views)
RGB
DEPTH

Baselines:

OURS without cross-attention (RGB)
OURS without cross-attention (depth)
SVD
SVD with cross-attention

Policy Rollout (unseen views):

OURS Success
Diffusion Policy fails to pick up the apple

(d) Real World Results 🌍

By finetuning a policy trained in simulation, our model transfers to real world RGB-D videos for several manipulation tasks.

Generated RGB-D videos
RGB
DEPTH