Sequoia China Open-Sources xbench Benchmarks for Advanced AI Evaluation

Beijing, [Date of Publication] – Sequoia China has officially open-sourced its xbench evaluation suite, a move poised to significantly impact the landscape of Artificial Intelligence (AI) benchmarking. The suite comprises two distinct subsets: xbench-ScienceQA, focusing on challenging scientific and engineering question answering, and xbench-DeepSearch, designed to assess the deep search capabilities of AI Agents.

The announcement, initially made three weeks ago by Sequoia China, reflects a commitment to fostering collaboration and transparency within the AI community. The firm aims to transform its internal evaluation tools into a publicly accessible AI benchmark, encouraging contributions from AI researchers and developers globally.

We believe the open-source spirit will allow xbench to evolve more effectively, creating greater value for the AI community, stated a representative from Sequoia China. The company plans to continuously update the evaluation suite based on the evolving landscape of large language models and AI Agents. A black and white box mechanism will be employed to cater to a broad range of developers while mitigating overfitting issues common in static evaluation sets, ensuring xbench’s long-term validity.

The open-source repositories are available at:

Website: https://xbench.org/
GitHub: https://github.com/xbench-ai/xbench-evals
Hugging Face: https://huggingface.co/datasets/xbench/ScienceQA and https://huggingface.co/datasets/xbench/DeepSearch

xbench-ScienceQA: Pushing the Boundaries of Scientific Reasoning

xbench-ScienceQA addresses the growing need for more rigorous evaluation metrics in the face of rapidly advancing reasoning models. Traditional academic benchmarks like MMLU and MATH are increasingly reaching near-perfect scores, rendering them less effective in differentiating true advancements in model capabilities.

To overcome this limitation, xbench-ScienceQA focuses on high-difficulty science and engineering questions. The questions, crafted by doctoral students and industry experts, are designed to be highly discerning, with an average accuracy rate of just 32%. This benchmark aims to assess the ability of AI systems to tackle complex scientific problems requiring deep understanding and reasoning. The emphasis is on knowledge and reasoning skills at the doctoral level. Existing datasets like GPQA, SuperGPQA and HLE are becoming new evaluation standards, but the number of questions at the graduate level is small, the questions are difficult to produce, the answers are difficult to verify, and there is a lack of regular updates after release, making it impossible to effectively check the degree of contamination of the evaluation set. Sequoia China has invited doctoral students from top universities and senior industry experts to collect and organize reliable sources.

xbench-DeepSearch: Evaluating AI Agent’s Deep Search Capabilities in the Chinese Internet Environment

xbench-DeepSearch is specifically designed to evaluate the deep search capabilities of AI Agents, focusing on planning, searching, reasoning, and summarization skills. Crucially, it is tailored to the Chinese internet environment, reflecting the unique characteristics and challenges of information retrieval within that context. This is particularly important as AI Agents increasingly play a role in navigating and synthesizing information from diverse online sources.

The Future of AI Benchmarking with xbench

The open-sourcing of xbench represents a significant step towards more robust and relevant AI benchmarking. By providing access to challenging and continuously updated evaluation sets, Sequoia China aims to accelerate the development of more capable and reliable AI systems. The black and white box approach to evaluation set management is particularly noteworthy, as it seeks to balance the need for accessibility with the imperative to prevent overfitting and maintain the integrity of the benchmark over time.

The xbench initiative underscores the growing importance of rigorous evaluation in the AI field, and its open-source nature promises to foster collaboration and innovation within the global AI community. As AI models continue to evolve, benchmarks like xbench will be crucial in guiding development and ensuring that AI systems are truly capable of solving complex, real-world problems.

References: