Chinese Team Trains Open-source LLaMA-Omni in 3 Days with 4 GPUs Beats GPT-4o in Response Time

Based on the information provided, here is a summary that could be used for a news article or report:

Title: Domestic Research Team Develops LLaMA-Omni: A Low-Latency, Open-Source Speech Interaction Model

Body:

A groundbreaking research effort by a Chinese team has led to the development of LLaMA-Omni, an open-source model that can train on just four GPUs in under three days. This innovative model, developed by researchers from the Institute of Computing Technology at the Chinese Academy of Sciences and the University of Chinese Academy of Sciences, represents a significant advancement in speech interaction with large language models (LLM).

LLaMA-Omni is capable of receiving voice commands and simultaneously generating text and voice responses with an incredibly low delay of 226ms, which is faster than the average audio response delay of 320ms for GPT-4o. This development is particularly notable as it addresses the limitations of most LLMs, which are primarily designed for text-based interactions, thus restricting their use in scenarios where text input and output are not suitable.

The emergence of GPT-4o has enabled voice interaction with LLMs, but the open-source community has been slow to explore the construction of such voice interaction models. The simplest method to achieve voice interaction with LLMs is through a cascade system involving automatic speech recognition (ASR) and text-to-speech (TTS) models. However, this approach often results in significant delays due to the sequential output of transcribed text, text response, and voice response.

LLaMA-Omni, on the other hand, discretizes speech into tokens and extends the vocabulary of LLMs to support voice input and output, theoretically allowing direct generation of voice responses from voice commands without the need for intermediate text, thus achieving very low response latency.

The model is composed of a speech encoder, a speech adapter, an LLM, and a streaming speech decoder. The speech encoder uses the Whisper-large-v3 model to extract meaningful representations from voice commands. The speech adapter maps these representations to the embedding space of the LLM, which in this case is the Llama-3.1-8B-Instruct model.

To train LLaMA-Omni, the researchers created a dataset called InstructS2S-200K by rewriting existing text instructions and synthesizing speech. The model is trained in two stages: the first stage focuses on the ability to generate text responses directly from voice commands, and the second stage trains the model to generate voice responses.

During inference, the LLM generates text responses autoregressively, and the speech decoder simultaneously generates the corresponding discrete units. This allows for the streaming synthesis of voice waveforms, ensuring a low response delay for users.

The development of LLaMA-Omni is a significant step forward in the field of speech interaction with LLMs, offering a more efficient and responsive way for users to interact with these powerful models.

References:
– Paper: LLaMA-Omni: Seamless Speech Interaction with Large Language Models
– Code: GitHub – ictnlp/LLaMA-Omni
– Model: Hugging Face – ICTNLP/Llama-3.1-8B-Omni

This summary captures the key points of the research and its significance in the field of artificial intelligence and speech interaction.

>>> Read more <<<