Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty‑Aware Sim‑to‑Real Manipulation

Wang, Maggie; Tian, Stephen; Swann, Aiden; Shorinwa, Ola; Wu, Jiajun; Schwager, Mac

Abstract

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success.

We fuse VLM priors with interactive online adaptation
to estimate physical parameters and improve sim-to-real manipulation.

Uncertainty-aware fusion of VLM priors and interaction
yields near-privileged sim-to-real performance on planar pushing.

T-block (weight at top): 57% vs 24% DR T-block (weight at bottom): 100% vs 79% DR Hammer: 15% faster than DR

Physically-Informed Digital Twin

We reconstruct simulation-ready assets from real scenes: segmented 3D Gaussian splats are converted into watertight meshes. This "real-to-sim" step lets policies train on objects that capture geometry relevant to dynamics.

Policy Learning & Online Adaptation

We train policies conditioned on interpretable physical parameters (e.g., the object's CoM). Training occurs in three phases: (1) privileged training with accurate parameters, (1.5) noise-aware fine-tuning for robustness, and (2) online adaptation via an uncertainty-aware ensemble that updates the parameter belief during execution.

A vision-language model provides a prior distribution over the parameters (mean + uncertainty). During interaction, an RMA-style estimator refines this belief. We fuse them using their uncertainties so the policy sees an interpretable estimate that improves over time.

How Do VLM Priors, Interaction, and Their Fusion
Each Contribute to Sim-to-Real Manipulation?

Q1

Can accurate physical parameter estimation improve policy performance?

Compare a privileged (ground-truth) physics-conditioned policy to domain-randomized and diffusion baselines.

Q2

Are VLM priors alone sufficient?

Condition the policy solely on VLM-estimated parameters, without interaction, to evaluate whether priors alone enable effective control.

Q3

Is interaction alone sufficient?

Remove the VLM prior and use only interactive online adaptation.

Q4

Does Phys2Real generalize to real-world objects?

Apply to image-reconstructed, mesh-free objects (e.g., hammer).

Hardware Experiments

Q1–Q3 T-block pushing: Phys2Real consistently outperforms DR and diffusion. With only VLM priors + interaction, we approach privileged performance while staying interpretable.

CDF = fraction of trials below each position error. Higher & further left is better. Example: 0.8 at 2 cm ⇒ 80% of trials ≤ 2 cm.

Takeaway. Neither VLM-only nor interaction-only suffices; their uncertainty-aware fusion closes the gap, approaching the performance of privileged conditioning.

Top weight: 24% (DR) → 57% (Phys2Real). Bottom weight: 79% → 100%.

Q4 Hammer (reconstructed object): both methods succeed, but Phys2Real is ~16% faster to completion.

Phys2Real — faster completion

Domain Randomization (DR)

Vision provides a prior; interaction reduces uncertainty for robust sim-to-real control.

How VLM Priors and Interactive Online Adaptation
Fusion Evolves Over Time

Early in each episode, the fused estimate relies on the VLM prior. As interaction accumulates and more contact information becomes available, the ensemble’s epistemic uncertainty shrinks, and the fused estimate converges toward the true physical parameters. This real-time adaptation aligns the policy’s internal physics estimates with the true environment, allowing more accurate and precise control.

T-Block (Weight on Top)

T-Block (Weight on Bottom)

Toward Physically Grounded Foundation Models

Phys2Real shows that combining foundation model priors with interactive online adaptation yields physically grounded, uncertainty-aware policies that transfer more reliably to the real world. By treating large vision models as sources of physical intuition and refining those estimates through real interaction, we move toward robotic systems that can understand, predict, and adapt in diverse environments.

BibTeX

@article{wang2025phys2real,
  title     = {Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation},
  author    = {Wang, Maggie and Tian, Stephen and Swann, Aiden and Shorinwa, Ola and Wu, Jiajun and Schwager, Mac},
  journal   = {arXiv preprint arXiv:2510.11689},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.11689}
}

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation