Our mission is to ensure that artificial general intelligence benefits all of humanity. As part of our mission, we want to transparently communicate progress on how AI models can help people in the real world. That’s why we’re introducing GDPval: a new evaluation designed to help us track how well our models and others perform on economically valuable, real-world tasks. We call this evaluation GDPval because we started with the concept of Gross Domestic Product (GDP) as a key economic indicator and drew tasks from the key occupations in the industries that contribute most to GDP.
People often speculate about AI’s broader impact on society, but the clearest way to understand its potential is by looking at what models are already capable of doing. History shows that major technologies—from the internet to smartphones—took more than a decade to go from invention to widespread adoption. Evaluations like GDPval help ground conversations about future AI improvements in evidence rather than guesswork, and can help us track model improvement over time.
Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.
To bridge this gap, we’ve been developing evaluations that measure increasingly realistic and economically relevant capabilities. This progression has moved from classic academic benchmarks like MMLU (exam-style questions across dozens of subjects), to more applied evaluations like SWE-Bench (software engineering bug-fixing tasks), MLE-Bench (machine learning engineering tasks such as model training and analysis), and Paper-Bench (scientific reasoning and critique on research papers), and more recently to market-based evaluations like SWE-Lancer (freelance software engineering projects based on real payouts).
GDPval is the next step in that progression. It measures model performance on tasks drawn directly from the real-world knowledge work of experienced professionals across a wide range of occupations and sectors, providing a clearer picture on how models perform on economically valuable tasks. Evaluating models on realistic occupational tasks helps us understand not just how well they perform in the lab, but how they might support people in the work they do every day.
What GDPval measures
GDPval, the first version of this evaluation, spans 44 occupations selected from the top 9 industries contributing to U.S. GDP. The GDPval full set includes 1,320 specialized tasks (220 in the gold open-sourced set), each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average from these fields. Every task is based on real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.
GDPval is distinctive both in its realism and diversity of tasks being evaluated. Unlike other evaluations tied to economic value which concentrate on specific domains (e.g., SWE-Lancer), GDPval covers many tasks and occupations. And unlike benchmarks which involve synthetically creating tasks in the style of an academic exam or test (e.g., Humanity’s Last Exam or MMLU), GDPval focuses on tasks based on deliverables that are either an actual piece of work or product that exists today or are a similarly constructed piece of work product.
Unlike traditional benchmarks, GDPval tasks are not simple text prompts. They come with reference files and context, and the expected deliverables span documents, slides, diagrams, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models might support professionals.
GDPval is an early step that doesn’t reflect the full nuance of many economic tasks. While it spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts. Future versions will extend to more interactive workflows and context-rich tasks to better reflect the complexity of real-world knowledge work (see more in our Limitations section below).
How we chose occupations
GDPval covers tasks across 9 industries and 44 occupations, and future versions will continue to expand coverage. The initial 9 industries were chosen based on those contributing over 5% to U.S. GDP, as determined by data from the Federal Reserve Bank of St. Louis. Then, we selected the 5 occupations within each industry that contribute most to total wages and compensation and are predominantly knowledge work occupations, using wage and employment data from the May 2024 US Bureau of Labor Statistics (BLS) occupational employment report(opens in a new window). To determine if the occupations were predominantly knowledge work, we used task data from O*NET(opens in a new window), a database of U.S. occupational information sponsored by the U.S. Department of Labor. We classified whether each task for each occupation in O*NET was knowledge work or physical work/manual labor (requiring actions to be taken in the physical world). An occupation qualified overall as “predominantly knowledge work” if at least 60% of its component tasks were classified as not involving physical work or manual labor. We chose this 60% threshold as a starting point for the first version of GDPval, focusing on occupations where AI could have the highest impact on real-world productivity.
This process yielded 44 occupations for inclusion.
