19 pages, including appendix. Scroll horizontally to flip through pages.
Handheld data collection systems, such as the Universal Manipulation Interface (UMI), enable scalable data collection across diverse environments but only capture observed actions rather than the desired actions executed by a robot controller. In contrast, teleoperation captures desired actions directly, but is prohibitively time-consuming to collect. We revisit this trade-off through the lens of label validity across task phases. We observe that handheld trajectories provide valid supervision in tolerant, free-space phases, but lack dynamic feasibility in contact-sensitive phases, where tracking observed trajectories at high stiffness produces large, unsafe contact forces. We study the interaction between these two supervision types for contact-rich manipulation and find that training policies that combine handheld data with a small number of targeted teleoperated demonstrations provides an efficient hybrid strategy. Specifically, rather than teleoperating the entire task, we only collect partial teleoperated demonstrations for task segments where base handheld policies fail. However, naively mixing handheld and teleoperated phase-specific data yields worse performance than training on handheld data alone. To address this mismatch between observed and desired supervision, we propose Bi-modal Routing for Imitation Data via Gated Experts (BRIDGE), a mixture of diffusion policy experts that routes between specialist task phase heads conditioned on the current robot state. Notably, our approach produces measurably lower end-effector forces under contact and improves success rates over handheld-only baselines by up to 36.7% across three contact-rich manipulation tasks.
Handheld devices record observed actions (where the end-effector actually went), while a robot controller is driven by desired actions (the reference it is commanded to track). In free-space, tolerant phases the two coincide, so observed handheld labels are valid supervision. In contact-sensitive phases they diverge: the desired position drives into the contact surface while the observed position is held above it. Tracking the observed trajectory at the stiffness needed to follow it then produces large, unsafe contact forces. This is why handheld-only labels fail precisely where contact matters — and why targeted teleoperation, which captures desired actions directly, is uniquely valid there.
We present a dual-mode data collection setup for contact-rich manipulation and introduce BRIDGE, a mixture-of-experts policy that jointly learns from handheld and teleoperated supervision.
UMI relies on offline monocular SLAM for trajectory reconstruction and ArUco-tag detection for gripper-width estimation. While this enables in-the-wild data collection, it requires offline processing for camera pose and gripper width — both incompatible with online teleoperation. We design Dual-Mode UMI (DM-UMI) to support both modes from a single device:
base dataset via handheld mode, identify
failure modes, then collect a targeted support dataset
via teleoperation. The base policy, support
head, and router compose the BRIDGE model.
base dataset is collected in
handheld mode across diverse environments, capturing only observed
actions. Broad coverage, low cost per trajectory.support dataset is collected in
teleoperated mode, targeted at base-policy failure
modes. It captures both observed and desired actions, isolating
embodiment-specific effects (controller dynamics, kinematics,
contact dynamics, grasp stability).
Our base policy extends Diffusion Policy. We retain the full set of
spatial patch tokens from a DINOv2 vision encoder (rather than only the
CLS token) to preserve spatial structure needed for contact-rich
reasoning. A PerceiverIO-style aggregation reduces \(V \in
\mathbb{R}^{B \times N \times D}\) tokens to a small set of learnable
queries \(Q \in \mathbb{R}^{B \times M \times D}\), \(M \ll N\), via
stacked cross-attention. State inputs (end-effector pose, gripper
width) are projected and cross-attended with the vision tokens to
produce the conditioning latent \(Z_\text{latent}\), which feeds a
temporal diffusion head trained with the standard diffusion loss on
the base dataset.
With the base expert \(\pi_b\) and shared vision encoder frozen, we
train a support latent adapter \(\phi_s\) and head
\(\pi_s\) on the support dataset. The action target is
the desired trajectory, \(\hat{a}^s_{t:t+H} = \pi_s(c^b_t)\). The
support expert is optimized independently with its own diffusion loss.
An MLP gate \(G_\psi(z_t)\) routes observations to the appropriate
expert. To label samples, we extract intermediate latents
\(Z_\text{latent}\) from the base policy on both datasets, filter
overlapping base latents using a \(k\)-NN classifier
(\(k=16\), distance \(\epsilon=0.8\)), then define a support score:
$$\sigma_{-} = \cos(z, z_b),\ z_b \in B_b, \qquad \sigma_{+} = \cos(z, z_s),\ z_s \in B_s$$ $$\rho(z) = (\sigma_{+} - \sigma_{-}) > \eta$$
We distill the \(k\)-NN classifier into \(G_\psi\) with binary cross-entropy on \(\rho(z_t)\).
At inference, each observation \(o_t\) is encoded into a shared \(Z_\text{vision}\) used to produce the router latent. The router emits gating predictions \(g_{t:t+H}\). Given threshold \(\eta\), we hard-switch between experts:
$$\hat{a}^b_{t+j} = \pi_b(\phi_b(Z_\text{vision})), \qquad \hat{a}^s_{t+j} = \pi_s(\phi_s(Z_\text{vision}))$$ $$m_{t+j} = \mathbb{I}[g_{t+j} > \eta], \qquad \hat{a}_{t+j} = (1 - m_{t+j})\hat{a}^b_{t+j} + m_{t+j}\,\hat{a}^s_{t+j}$$
Unlike residual learning — which requires every expert to model the entire task — the hard switch lets each expert specialize locally.
We evaluate on three precise, contact-rich tasks on a Franka FR3 arm running a Cartesian impedance controller. Each concentrates difficulty in a short contact-sensitive phase bracketed by free-space motion.
NIST Pulley Routing
Grasp a deformable O-ring and route it around the pulleys while maintaining constant tension (NIST Assembly Task Board #2).
Pipe Insertion
Grasp a pipe and insert it into a tight-tolerance opening, requiring reliable grasping, 6-DoF alignment, and force-aware insertion.
Battery Insertion
Pick up an AA battery and seat it in a spring-loaded compartment, keeping the spring compressed while the battery is seated.
BRIDGE consistently outperforms handheld and naive-mixing baselines and
recovers much of the gap to a full-teleoperation upper bound —
even though its sparse support data covers under 15% of the
end-effector path length. Naively mixing observed and desired labels in
a single head degrades performance, often below the
handheld-only policy, motivating state-gated routing instead.
| Method | NIST Pulley | Pipe Insertion | Battery Insertion |
|---|---|---|---|
Base Policy (Handheld) |
44.0% | 13.3% | 10.0% |
| Naive Mix | 0.0% | 6.7% | 0.0% |
| BRIDGE (Ours) | 76.0% | 50.0% | 33.3% |
Base Policy (Teleoperated)† |
84.0% | 63.3% | 40.0% |
†Full-task teleoperation upper bound, collected under an approximately time-matched data budget. Each task is evaluated over 25–30 rollouts across shifted fixture positions.
| Task | Base # |
Support # |
Teleop # |
Support Temporal % | Support Dist. % |
|---|---|---|---|---|---|
| NIST Pulley Routing | 201 | 50 | 60 | 37.8% | 7.4% |
| Pipe Insertion | 100 | 81 | 60 | 24.5% | 3.4% |
| Battery Insertion | 290 | 59 | 100 | 61.1% | 14.9% |
support data spans a large temporal fraction (slower
teleoperation rate) yet remains spatially sparse, covering only a small
fraction of the end-effector path length.
On a held-out teleoperated set with manually labeled support phases
(pipe insertion), the MLP router reaches 99.0% recall
and 69.0% precision, favoring early
support activation over missed handoffs. A t-SNE of the
router latents shows clear separation between base and
support states, with false positives concentrated near the
support-manifold boundary (transition states) rather than unrelated
base states — consistent with the nearest-neighbor
design.
19 pages, including appendix. Scroll horizontally to flip through pages.
@misc{surendran2026bridge,
title = {Bridging Handheld and Teleoperated Supervision for Contact-Rich Manipulation via State-Gated Experts},
author = {Surendran, Vidullan and Peri, Neehar and Watkins, David},
year = {2026},
}