Alibaba’s Tongyi Unleashes OmniAudio New Spatial Audio AI Model

A new AI tool from Alibaba’s Tongyi Lab promises to revolutionize immersive experiences by generating spatial audio directly from 360° videos.

The world of virtual reality and immersive entertainment is constantly evolving, demanding increasingly realistic and engaging experiences. A key component of this realism is spatial audio, which accurately replicates how sound interacts with our environment, providing a sense of direction and depth. Now, Alibaba’s Tongyi Lab has stepped into the arena with OmniAudio, a groundbreaking AI model designed to generate spatial audio (specifically, First-Order Ambisonics or FOA) from 360° videos.

What is OmniAudio?

OmniAudio is a cutting-edge technology developed by the speech team at Alibaba’s Tongyi Lab. It tackles the challenge of creating immersive audio experiences by directly generating FOA audio from 360° videos. This is a significant advancement because traditional video-to-audio generation techniques typically produce non-spatial audio, which lacks the directional information crucial for a truly immersive experience.

How does it work?

The development of OmniAudio relied on the creation of a massive dataset called Sphere360. This dataset comprises over 103,000 video clips, encompassing 288 different audio events and totaling 288 hours of footage. This rich resource provided the foundation for training the OmniAudio model.

The training process is divided into two key stages:

Self-Supervised Coarse-to-Fine Flow Matching Pre-training: This initial stage leverages a large volume of non-spatial audio resources to enable self-supervised learning, allowing the model to learn fundamental audio characteristics.
Supervised Fine-Tuning with Dual-Branch Video Representation: This stage refines the model’s ability to represent sound source direction through supervised learning, utilizing a dual-branch video representation approach.

Key Features and Benefits:

Spatial Audio Generation: OmniAudio directly generates FOA audio from 360° videos. FOA is a standard 3D spatial audio format that captures the directionality of sound, enabling realistic 3D audio reproduction. It uses four channels (W, X, Y, Z) to represent sound, with the W channel capturing overall sound pressure and the X, Y, and Z channels capturing sound information from front to back, left to right, and vertically, respectively.
Enhanced Immersive Experience: By generating spatial audio, OmniAudio significantly enhances the immersive experience in virtual reality and other immersive entertainment applications. It addresses the limitations of traditional video-to-audio generation techniques that fail to capture the 3D sound localization required for true immersion.
Accurate Sound Localization: The FOA format ensures accurate sound localization, even as the listener’s head rotates, further enhancing the realism of the audio experience.

Implications for the Future:

OmniAudio has the potential to revolutionize various fields, including:

Virtual Reality (VR): Creating more realistic and engaging VR experiences.
Augmented Reality (AR): Enhancing AR applications with spatially accurate sound.
Gaming: Providing a more immersive and realistic gaming experience.
Film and Television: Creating more captivating and believable audio environments for 360° video content.

Conclusion:

Alibaba’s OmniAudio represents a significant step forward in the field of spatial audio generation. By leveraging a massive dataset and a sophisticated training process, Tongyi Lab has created a powerful tool that promises to transform immersive experiences across a wide range of applications. As VR, AR, and 360° video content continue to grow in popularity, technologies like OmniAudio will be crucial in delivering truly believable and engaging audio environments.

References: