3d_diffuser_actor / README.md
twke's picture
Update README.md
f7c5dfc verified
|
raw
history blame
6.5 kB
metadata
'[object Object]': null
license: mit
language:
  - en

Model Card for 3D Diffuser Actor

A robot manipulation policy that marries diffusion modeling with 3D scene representations. 3D Diffuser Actor is trained and evaluated on RLBench or CALVIN simulation. We release all code, checkpoints, and details involved in training these models.

Model Details

The models released are the following:

Benchmark Embedding dimension Diffusion timestep
RLBench (PerAct) 120 100
RLBench (GNFactor) 120 100
CALVIN 192 25

Model Description

  • Developed by: Katerina Group at CMU
  • Model type: a Diffusion model with 3D scene
  • License: The code and model are released under MIT License
  • Contact: ngkanats@andrew.cmu.edu

Model Sources [optional]

Uses

Input format

3D Diffuser Actor takes the following inputs:

  1. RGB observations: a tensor of shape (batch_size, num_cameras, 3, H, W). The pixel values are in the range of [0, 1]
  2. Point cloud observation: a tensor of shape (batch_size, num_cameras, 3, H, W).
  3. Instruction encodings: a tensor of shape (batch_size, max_instruction_length, C). In this code base, the embedding dimension C is set to 512.
  4. curr_gripper: a tensor of shape (batch_size, history_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D).
  5. trajectory_mask: a tensor of shape (batch_size, trajectory_length), which is only used to indicate the length of each trajectory. To predict keyposes, we just need to set its shape to (batch_size, 1).
  6. gt_trajectory: a tensor of shape (batch_size, trajectory_length, 7), where the last channel denotes xyz-action (3D) and quarternion (4D). The input is only used during training.

Output format

The model returns the diffusion loss, when run_inference=False, otherwise, it returns pose trajectory of shape (batch_size, trajectory_length, 8) when run_inference=True.

Usage

For training, forward 3D Diffuser Actor with run_inference=False

> loss = model.forward(gt_trajectory,
                       trajectory_mask,
                       rgb_obs,
                       pcd_obs,
                       instruction,
                       curr_gripper,
                       run_inference=False)

For evaluation, forward 3D Diffuser Actor with run_inference=True

> fake_gt_trajectory =  torch.full((1, trajectory_length, 7), 0).to(device)
> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.forward(fake_gt_trajectory,
                             trajectory_mask,
                             rgb_obs,
                             pcd_obs,
                             instruction,
                             curr_gripper,
                             run_inference=True)

Or you can forward the model with compute_trajectory function

> trajectory_mask = torch.full((1, trajectory_length), False).to(device)
> trajectory = model.compute_trajectory(trajectory_mask,
                                        rgb_obs,
                                        pcd_obs,
                                        instruction,
                                        curr_gripper)

Evaluation

Our model trained and evaluated on RLBench simulation with the PerAct setup:

RLBench (PerAct) 3D Diffuser Actor RVT
average 81.3 62.9
open drawer 89.6 71.2
slide block 97.6 81.6
sweep to dustpan 84.0 72.0
meat off grill 96.8 88
turn tap 99.2 93.6
put in drawer 96.0 88.0
close jar 96.0 52.0
drag stick 100.0 99.2
stack blocks 68.3 28.8
screw bulbs 82.4 48.0
put in safe 97.6 91.2
place wine 93.6 91.0
put in cupboard 85.6 49.6
sort shape 44.0 36.0
push buttons 98.4 100.0
insert peg 65.6 11.2
stack cups 47.2 26.4
place cups 24.0 4.0

Our model trained and evaluated on RLBench simulation with the GNFactor setup:

RLBench (PerAct) 3D Diffuser Actor GNFactor
average 78.4 31.7
open drawer 89.3 76.0
sweep to dustpan 894.7 25.0
close jar 82.7 25.3
meat off grill 88.0 57.3
turn tap 80.0 50.7
slide block 92.0 20.0
put in drawer 77.3 0.0
drag stick 98.7 37.3
push buttons 69.3 18.7
stack blocks 12.0 4.0

Our model trained and evaluated on CALVIN simulation (train with environment A, B, C and test on D):

RLBench (PerAct) 3D Diffuser Actor GR-1 SuSIE
task 1 92.2 85.4 87.0
task 2 78.7 71.2 69.0
task 3 63.9 59.6 49.0
task 4 51.2 49.7 38.0
task 5 41.2 40.1 26.0

Citation [optional]

BibTeX:

@article{,
  title={Action Diffusion with 3D Scene Representations},
  author={Ke, Tsung-Wei and Gkanatsios, Nikolaos and Fragkiadaki, Katerina}
  journal={Preprint},
  year={2024}
}

Model Card Contact

For errors in this model card, contact Nikos or Tsung-Wei, {ngkanats, tsungwek} at andrew dot cmu dot edu.