"Make a short line out of these objects in the center of the table"
Robots operating in human environments must be able to rearrange objects into semantically-meaningful configurations, even if these objects are previously unseen. We focus on the problem of building physically-valid structures without step-by-step instructions.
We propose StructDiffusion, which combines a diffusion model and an object-centric transformer to construct structures out of a single RGB-D image based on high-level language goals, such as "set the table" and "make a line".
StructDiffusion improves success rate on assembling physically-valid structures out of unseen objects by on average 16% over an existing multi-modal transformer model, while allowing us to use one multi-task model to produce a wider range of different structures. We show experiments on held-out objects in both simulation and on real-world rearrangement tasks.
Building semantically meaningful structures requires satisfying two different sets of constraints: (1) we must place objects in the correct positions to satisfy the desired spatio-semantic relations; (2) we must ensure that objects are not colliding and arrangements are structurally sound.
We propose StructDiffusion, a single framework that jointly optimizes over these two, sometimes contrasting, constraints.
We use unknown object instance segmentation to break our scene up into objects, as per prior work (e.g., [1], [2], [3]). Then, we use a multi-modal transformer to combine both word tokens and object encodings from Point Cloud Transformer in order to make 6-DoF goal pose predictions. These predictions are both refined iteratively via diffusion and selected with a discriminator model that learns to recognize unrealistic samples.
Our diffusion model is integrated with a transformer model that maintains an individual attention stream for each object. This object-centric approach allows us to focus on learning the interactions between objects based on their geometric features as well as the grounding of abstract concepts on spatio-semantic relations between objects (e.g., large, circle, top).
We start from the last step of the reverse diffusion process and jointly predict goal poses for all objects in the scene. This allows our model to reason about object-object interactions in a generalizable way, which outperforms simply predicting goal poses from multi-modal inputs.
"Make a short line out of these objects in the center of the table"
"Make a small circle out of these objects in the center of the table"
"Make a short line out of these objects in the center of the table"
"Make a tower out of these objects in the center of the table"