NWPU’s OSUM Open-Source Voice Understanding Model Unleashed

作者智能小编

2 月 23, 2025 #nwpu, #opensource, #每日AI快讯

AI Tool, [Date] – Northwestern Polytechnical University (NPU) has released OSUM (Open Speech Understanding Model), an open-source speech understanding model developed by the Audio, Speech, and Language Processing Group of the School of Computer Science. This innovative AI tool combines the strengths of the Whisper encoder with the Qwen2 LLM, offering a versatile solution for various speech-related tasks.

What is OSUM?

OSUM is designed to tackle a wide range of speech understanding challenges, including:

Automatic Speech Recognition (ASR): Transcribing spoken language into text.
Speech Emotion Recognition (SER): Identifying the emotional state conveyed in speech.
Speaker Gender Classification (SGC): Determining the gender of the speaker.

The model is trained using an ASR+X multi-task training strategy, which leverages modality alignment and target task optimization for efficient and stable training. With approximately 50,000 hours of diverse speech data, OSUM demonstrates exceptional performance across multiple tasks, particularly in Chinese ASR and multi-task generalization.

Key Features of OSUM:

Speech Recognition: Converts speech to text, supporting multiple languages and dialects.
Speech Recognition with Timestamps: Outputs the start and end times of each word or phrase during speech recognition.
Speech Event Detection: Identifies specific events in speech, such as laughter, coughing, and background noise.
Speech Emotion Recognition: Analyzes the emotional state in speech, such as happiness, sadness, or anger.
Speaking Style Recognition: Identifies the speaker’s style, such as news broadcasting, customer service dialogue, or everyday speech.
Speaker Gender Classification: Determines the speaker’s gender (male or female).
Speaker Age Prediction: Predicts the speaker’s age range (e.g., child, adult, or elderly).
Speech-to-Text Chat: Converts speech input into natural language responses for use in dialogue systems.

Technical Principles:

The core of OSUM’s architecture lies in its Speech Encoder, which processes the input audio and extracts relevant features. These features are then fed into downstream task-specific modules, enabling the model to perform various speech understanding tasks with high accuracy.

Conclusion:

Northwestern Polytechnical University’s release of OSUM marks a significant contribution to the field of open-source speech understanding models. Its multi-task capabilities, extensive training data, and robust performance make it a valuable tool for researchers and developers alike. As AI continues to evolve, models like OSUM will play a crucial role in advancing speech recognition and understanding technologies.

>>> Read more <<<