Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation. This problem demands the tight integration of perceptual modeling and motion synthesis, yet despite its importance, it remains largely unexplored.
Most prior work has focused on mapping modalities such as speech, audio, or music to generate human motion. However, these approaches typically overlook the role of spatial features encoded in spatial audio signals, and how those features influence human movement.
To bridge this gap and enable high-quality modeling of human motion in response to spatial audio, we will introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset. SAM contains diverse, high-quality spatial audio paired with corresponding human motion data.
For benchmarking, we will develop a simple yet effective diffusion-based generative framework for human motion generation driven by spatial audio, termed MOSPA. MOSPA faithfully captures the relationship between body motion and spatial audio through an effective multimodal fusion mechanism. Once trained, the model can generate diverse and realistic human motions conditioned on varying spatial audio inputs.
Finally, we will conduct a thorough investigation of the proposed dataset and perform extensive benchmarking experiments. Our approach achieves state-of-the-art performance on this task, demonstrating the effectiveness of both the dataset and the proposed framework.