Kuaishou & Tsinghua’s DiffMoE Dynamic Tokens Boost Diffusion Model Performance

In the rapidly evolving landscape of generative AI, Diffusion Models have solidified their position as the dominant architecture for image generation tasks. However, traditional diffusion models often fall short in computational efficiency due to their uniform treatment of varying noise levels and conditional inputs throughout the diffusion process. Addressing this limitation, a collaborative effort between Tsinghua University and the Kuaishou Keyframe team has yielded a groundbreaking innovation: DiffMoE (Dynamic Token Selection for Scalable Diffusion Transformers). This novel approach introduces a dynamic token selection mechanism coupled with a global token pool design, effectively pushing the boundaries of efficiency and performance in diffusion models.

This article delves into the intricacies of DiffMoE, exploring its core mechanisms, performance advantages, and potential implications for the future of generative AI. We will examine the research team’s motivations, the technical challenges they overcame, and the potential applications of this innovative technology.

The Rise of Diffusion Models and Their Inherent Limitations

Diffusion models, inspired by non-equilibrium thermodynamics, operate by gradually adding noise to an image until it becomes pure noise, and then learning to reverse this process to generate images from noise. This approach has proven remarkably effective in producing high-quality, photorealistic images, surpassing the performance of earlier generative models like GANs (Generative Adversarial Networks) in many aspects.

However, the computational demands of diffusion models are significant. The iterative nature of the diffusion and denoising processes, coupled with the need to process large image datasets, can make training and inference computationally expensive. A major contributor to this inefficiency is the uniform treatment of all tokens (image patches or features) throughout the diffusion process. In reality, different tokens carry varying levels of importance at different stages of denoising. For instance, tokens corresponding to fine details might be more crucial in the later stages of denoising, while tokens representing global structure might be more important in the early stages. Treating all tokens equally leads to redundant computations and wasted resources.

DiffMoE: Addressing the Inefficiency Bottleneck

The DiffMoE framework directly tackles this inefficiency by introducing a dynamic token selection mechanism. This mechanism intelligently selects the most relevant tokens for processing at each denoising step, effectively reducing the computational burden without sacrificing image quality.

Key Components of DiffMoE:

Dynamic Token Selection: The core innovation of DiffMoE lies in its ability to dynamically select tokens based on their relevance to the current denoising step. This selection process is guided by a learnable gating network that analyzes the input tokens and their corresponding noise levels. The gating network assigns a score to each token, reflecting its importance. Only tokens with scores exceeding a certain threshold are selected for further processing. This selective processing significantly reduces the computational cost, especially in the early stages of denoising where much of the image consists of noise.
Batch-Level Global Token Pool: To maintain global context and ensure coherence in the generated images, DiffMoE incorporates a batch-level global token pool. This pool aggregates information from all tokens within a batch, allowing the model to capture long-range dependencies and relationships between different parts of the image. The global token pool acts as a shared memory that facilitates information exchange between selected tokens, preventing the model from focusing solely on local details and losing sight of the overall image structure.
Scalable Diffusion Transformers: DiffMoE is built upon the foundation of diffusion transformers, which are transformer-based architectures specifically designed for diffusion models. Transformers are well-suited for capturing long-range dependencies in images, making them ideal for diffusion modeling. By integrating the dynamic token selection mechanism and the global token pool into a diffusion transformer architecture, DiffMoE achieves both efficiency and high-quality image generation.

Technical Deep Dive: How DiffMoE Works

Let’s delve into the technical details of how DiffMoE operates:

Input Encoding: The input image is first encoded into a sequence of tokens, typically by dividing the image into patches and then embedding each patch into a feature vector. These tokens, along with their corresponding noise levels, are fed into the DiffMoE model.
Gating Network: The gating network analyzes each token and its noise level to determine its importance. The gating network typically consists of a multi-layer perceptron (MLP) that takes the token embedding and noise level as input and outputs a score representing the token’s relevance.
Token Selection: Based on the scores assigned by the gating network, a subset of tokens is selected for further processing. The selection is typically performed using a thresholding mechanism, where only tokens with scores exceeding a certain threshold are selected. The threshold can be fixed or learned adaptively.
Global Token Pool Aggregation: The selected tokens are then aggregated into a global token pool. This pool can be implemented using various techniques, such as attention mechanisms or pooling operations. The global token pool allows the model to capture long-range dependencies and relationships between different parts of the image.
Denoising Process: The selected tokens, along with the global token pool, are then fed into the denoising network. The denoising network is typically a transformer-based architecture that predicts the noise to be removed from the tokens.
Iterative Refinement: The denoising process is repeated iteratively, gradually removing noise from the tokens until a high-quality image is generated. At each iteration, the dynamic token selection mechanism selects the most relevant tokens for processing, ensuring that the computational resources are focused on the most important parts of the image.

Experimental Results and Performance Evaluation

The researchers conducted extensive experiments to evaluate the performance of DiffMoE. They compared DiffMoE against state-of-the-art diffusion models on various image generation benchmarks, including ImageNet and CelebA. The results demonstrated that DiffMoE achieves significant improvements in both efficiency and image quality.

Improved Efficiency: DiffMoE significantly reduces the computational cost of diffusion modeling, especially in the early stages of denoising. This efficiency gain allows for faster training and inference, making diffusion models more accessible to researchers and practitioners with limited computational resources.
Enhanced Image Quality: Despite the reduction in computational cost, DiffMoE maintains or even improves the quality of generated images. The dynamic token selection mechanism ensures that the most relevant tokens are processed, preserving important details and maintaining global coherence.
Scalability: The dynamic token selection mechanism makes DiffMoE more scalable to larger image sizes and more complex datasets. By selectively processing only the most relevant tokens, DiffMoE can handle larger inputs without exceeding computational limits.

Implications and Future Directions

The development of DiffMoE represents a significant step forward in the field of generative AI. By addressing the inefficiency bottleneck of traditional diffusion models, DiffMoE opens up new possibilities for high-quality image generation and other generative tasks.

Faster Training and Inference: The improved efficiency of DiffMoE allows for faster training and inference, making diffusion models more practical for real-world applications.
Resource-Constrained Environments: DiffMoE’s efficiency makes it suitable for deployment in resource-constrained environments, such as mobile devices or edge computing platforms.
Larger and More Complex Datasets: The scalability of DiffMoE enables the training of diffusion models on larger and more complex datasets, leading to more realistic and diverse image generation.
New Generative Applications: The advancements in diffusion modeling enabled by DiffMoE can lead to new applications in various fields, including image editing, video generation, 3D modeling, and drug discovery.

Future research directions include:

Adaptive Thresholding: Developing more sophisticated methods for adaptively adjusting the token selection threshold based on the image content and noise level.
Contextual Token Selection: Incorporating contextual information into the token selection process to further improve the relevance of selected tokens.
Multi-Modal Diffusion Models: Extending DiffMoE to multi-modal diffusion models that can generate images from text descriptions or other modalities.
Applications Beyond Image Generation: Exploring the potential of DiffMoE in other generative tasks, such as audio generation and natural language generation.

The Kuaishou & Tsinghua University Collaboration: A Model for Innovation

The success of DiffMoE is a testament to the power of collaboration between academia and industry. The combination of Tsinghua University’s cutting-edge research expertise and Kuaishou’s practical engineering experience has resulted in a truly innovative solution that addresses a critical challenge in the field of generative AI. This collaboration serves as a model for future partnerships between academic institutions and industry leaders, fostering innovation and accelerating the development of new technologies.

Conclusion

DiffMoE represents a significant advancement in diffusion modeling, offering a compelling solution to the efficiency challenges that have plagued traditional approaches. By intelligently selecting the most relevant tokens for processing, DiffMoE achieves significant improvements in both computational efficiency and image quality. This innovation paves the way for faster training, more scalable models, and a wider range of applications for diffusion models in the future. The collaboration between Tsinghua University and the Kuaishou Keyframe team underscores the importance of partnerships between academia and industry in driving innovation in the rapidly evolving field of artificial intelligence. As research continues to build upon the foundation laid by DiffMoE, we can expect to see even more impressive advancements in generative AI in the years to come. The future of visual generation is undoubtedly bright, and DiffMoE has played a pivotal role in shaping its trajectory.

References

Shi, M., et al. (2025). DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers. arXiv preprint arXiv:2503.14487.
Project Homepage: https://shiml20.github.io/DiffMoE/
Code Repository: https://github.com/KwaiVGI/DiffMoE

>>> Read more <<<