Towards Immersive Human-X Interaction: A Real-Time
Framework for Physically Plausible Motion Synthesis

Kaiyang Ji1, Ye Shi1,2, Zichen Jin1, Kangyi Chen1, Lan Xu1,2, Yuexin Ma1,2, Jingyi Yu1,2, Jingya Wang1,2,*
1ShanghaiTech University, 2Shanghai Engineering Research Center of Intelligent Vision and Imaging,
*Corresponding Authors
ICCV 2025 Highlight
Interpolate start reference image.

Figure 1. We propose Human-X, the first framework designed to enable latency-free interaction between humans and diverse entities, including human-avatar, human-humanoid, and human-robot interaction.

Abstract

Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners’ movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.

Method

Interpolate start reference image.

Figure 2. Overview of our immersive real-time interaction synthesis pipeline: (a) Actor Motion Capture: A human actor's movements are recorded at 30 fps by an RGB-D camera and translated into 3D poses, which are then retargeted to a humanoid character. (b) Realistic Reactor Motion Generation: An auto-regressive diffusion model, guided by optional text prompts (e.g., “Dancing is what to do”), generates plausible reaction motions. These motions are tracked by an actor-aware controller, which uses proprioception signals to ensure realistic, synchronized interactions. (c) Real-time VR Interface: The generated and tracked motions are rendered in Isaac Gym, providing both a third-person view and a binocular VR view.

Visualization Results

Interpolate start reference image.
Interpolate start reference image.

Figure 3. Visualization of human reaction synthesis results. Blue for actors and Orange for reactors. Compared to CAMDM (top row), Human-X (bottom row) achieves more complete hand contact in tasks such as face-hitting and handshaking. Additionally, its foot movement appears more natural, as highlighted in the red and green circles.

Interpolate start reference image.

Figure 4. Visualization results on Human-Robot Interaction. The robot (black skeleton) and human (orange mesh) perform a handshake on a flat plane, from arm extension and palm contact through to shake completion, illustrating our method’s spatial coordination and motion coherence.