SHANGHAI – ZHIYUAN Robotics has officially launched Genie Operator-1 (GO-1), its first general-purpose embodied foundation model. This innovative model is built upon the novel Vision-Language-Latent-Action (ViLLA) architecture, integrating a multimodal large model (VLM) and a Mixture of Experts (MoE) system. By predicting latent action tokens, GO-1 effectively bridges the gap between image-text inputs and robotic action execution, marking a significant leap forward in the field of embodied AI.
The announcement highlights ZHIYUAN’s commitment to pushing the boundaries of robotics and artificial intelligence. The ViLLA architecture, the core of GO-1, represents a significant evolution from previous Vision-Language-Action (VLA) models.
ViLLA: Bridging the Gap Between Perception and Action
The ViLLA architecture is comprised of two key components: a VLM and a MoE. The VLM leverages vast amounts of internet image and text data to achieve general scene perception and language understanding. The MoE, on the other hand, consists of a Latent Planner and an Action Expert.
- Latent Planner: This component utilizes extensive cross-embodiment and human operation video data to gain a universal understanding of actions.
- Action Expert: Trained on millions of real-world robotic data points, the Action Expert possesses the ability to execute precise and nuanced actions.
This interconnected system allows GO-1 to learn from human video demonstrations and rapidly generalize to new tasks with limited examples, significantly lowering the barrier to entry for embodied intelligence applications. ZHIYUAN has successfully deployed GO-1 across its range of robotic platforms, continuously improving its capabilities and ushering in a new era for embodied AI.
AgiBot World: Fueling the Development of GO-1
The development of GO-1 was significantly aided by ZHIYUAN’s creation of AgiBot World in late 2024. This comprehensive dataset contains over one million trajectories, encompassing 217 tasks across five distinct scenarios. The high-quality, real-world data within AgiBot World provided a crucial foundation for training and refining GO-1’s capabilities.
Exceeding State-of-the-Art Performance
According to ZHIYUAN, GO-1’s ViLLA architecture allows it to surpass existing open-source state-of-the-art models in real-world dexterous manipulation and long-duration tasks. By predicting Latent Action Tokens, GO-1 effectively navigates the complexities of translating image and text instructions into concrete robotic actions.
The Future of Embodied AI
The launch of GO-1 represents a significant milestone in the development of embodied AI. By combining advanced perception, planning, and execution capabilities, GO-1 paves the way for robots that can seamlessly interact with and learn from the real world. ZHIYUAN’s innovative ViLLA architecture and the comprehensive AgiBot World dataset provide a strong foundation for future advancements in the field, promising a future where robots can perform increasingly complex and nuanced tasks.
References
ZHIYUAN Robotics. (2024). AgiBot World & Genie Operator-1 (GO-1). Retrieved from https://agibot-world.com/blog/agibot_go1.pdf
Views: 0
