ManiWAV:

Learning Robot Manipulation from In-the-Wild Audio-Visual Data

Human Demonstration with Audio-Visual Feedback
Robot Policy Rollout

Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations.


Technical Summary Video (4 min)


Capability Experiments

(a) Wiping Whiteboard 🪧

The robot is tasked to wipe a shape (e.g. heart, square) drawn on a whiteboard. The robot can start in any initial configuration above the whiteboard and grasp an eraser in parallel to the board. The main challenge of the task is that the robot needs to exert a reasonable amount of contact force on the whiteboard while moving the eraser along the shape. Watch the below video for details of the task and ablations.

ManiWAV (unmute to hear the contact mic recording):

In distribution
Unseen shape (e.g. star)
Unseen table height
Unseen eraser

Baselines:

Vision only: Eraser fails to get into contact with the whiteboard and floats.
Vision only: Eraser does not get into contact with the whiteboard and floats.
MLP fusion: Policy terminates early before shape is completely wiped off.
No noise augmentation: Robot presses too hard on the whiteboard, causing gripper to bend.

Key Findings:


(b) Flipping Bagel 🥯

The robot is tasked to flip a bagel in a pan from facing down to facing upward using a spatula. To perform this task successfully, the robot needs to sense and switch between different contact modes -- precisely insert the spatula between the bagel and the pan, maintain the contact while sliding, and start to tilt up the spatula when the bagel is in contact with the edge of the pan.

ManiWAV (unmute to hear the contact mic recording):

In distribution
In distribution
Unseen table height
Noise perturbation

Baselines:

Vision only: Spatula pokes on the side of the bagel.
Vision only: Robot loses contact with the bagel before it's flipped.
ResNet: Policy trained with a ResNet18 andio encoder fails due to spatula displacement.
MLP policy: Using MLP instead of Diffusion Policy also sometimes loses contact with the bagel before it's flipped.

In-the-Wild Generalization:

Different Bagels
Different Pans
Different Pans
Different Environments

Key Findings:


(c) Pouring 🎲

The robot is tasked to pick up the white cup and pour dice out to the pink cup if the white cup is not empty. When finish pouring, the robot needs to place the empty cup down to a designated location. The challenge of the task is that the robot cannot observe whether there are dice in the cup or not given the camera view point both before and after the pouring action, therefore it needs to leverage feedback from vibrations of objects inside the cup. Watch the below video for details of the task and ablations.

Key Findings:


(d) Taping Wires with Velcro Tape âž°

The robot is tasked to choose the 'hook' tape from several tapes (either 'hook' or 'loop') and strap wires by attaching the 'hook' tape to a 'loop' tape underneath the wires. The challenge of the task is that the difference between 'loop' and 'hook' tape are not observable with vision, but the subtle difference in surface material can generate different sounds when 'sliding' the gripper finger against the tape. Watch the below video for details of the task and ablations.

Key Findings:


More Results

Attention Map Visualization

Description of Image

Interestingly, we find that a policy co-trained with audio attends more on the task-relevant regions (shape of drawing or free space inside the pan). In contrast, the vision only policy often overfits to background structures as an shortcut to estimate contact (e.g., the edge of the whiteboard, table, and room structures).