Beijing – ByteDance’s Doubao large language model (LLM) team has announced the development of UltraMem, a groundbreaking new ultra-sparse model architecture designed to tackle the persistent challenges of high memory access costs associated with traditional Mixture-of-Experts (MoE) models during inference. This innovation promises to significantly reduce inference costs and accelerate processing speeds, potentially revolutionizing the deployment of large-scale AI models.
The announcement comes at a time when the demand for efficient and cost-effective AI solutions is rapidly growing. Traditional MoE architectures, while powerful, often suffer from substantial memory access overhead, hindering their practical application in resource-constrained environments. UltraMem aims to address this bottleneck head-on.
According to the Doubao team, UltraMem achieves its performance gains through a combination of key technological advancements:
-
Multi-Layer Structure Improvements: Instead of a single, large memory layer, UltraMem distributes smaller memory layers throughout the Transformer architecture. This, coupled with skip-layer operations, enables parallel computation and reduces memory access bottlenecks.
-
Optimized Value Retrieval: UltraMem employs Tucker Decomposition Query Key Retrieval (TDQKR) to enhance the precision of value retrieval, ensuring that the most relevant information is accessed efficiently.
-
Implicitly Expanded Sparse Parameters (IVE): By leveraging the concepts of virtual and physical memory, IVE reduces the memory footprint and deployment costs associated with sparse models. This allows for the efficient utilization of available hardware resources.
The results of initial experiments are compelling. UltraMem demonstrates significant performance advantages across various activation parameter scales. Notably, its scalability surpasses that of traditional MoE architectures as the number of sparse parameters increases, suggesting its potential for handling even larger and more complex models.
Key benefits of UltraMem include:
-
Reduced Inference Costs: By optimizing memory access, UltraMem can reduce inference costs by up to 83%, making large language models more accessible and affordable to deploy.
-
Increased Inference Speed: Compared to traditional MoE architectures, UltraMem boasts inference speeds that are 2-6 times faster. This improvement is particularly noticeable in common batch size scenarios, where memory access costs are comparable to models with similar computational demands.
The development of UltraMem represents a significant step forward in the pursuit of efficient and scalable AI. By addressing the memory access bottleneck, ByteDance’s Doubao team is paving the way for the broader adoption of large language models in a variety of applications. Further research and development will be crucial to fully realize the potential of this innovative architecture.
References:
- Information retrieved from: AI工具集
Views: 0