Mercor's APEX Benchmark Tests AI on High-Value Professional Jobs

A new benchmark called the AI Productivity Index (APEX) has been developed to measure the ability of artificial intelligence models to perform complex tasks typically handled by lawyers, doctors, and financial analysts. Created by the data company Mercor, the test uses 200 assignments designed to simulate high-value professional work.

The evaluation, which cost over $500,000 to develop, aims to provide a clear measure of AI's economic utility in white-collar professions. Early results show rapid improvement in AI capabilities, though models still fall short of consistent expert-level human performance.

Key Takeaways

Mercor launched the AI Productivity Index (APEX) to test AI on professional tasks in law, medicine, finance, and consulting.
The benchmark consists of 200 tasks created by highly-paid experts from top firms like McKinsey and Goldman Sachs.
OpenAI's latest models show significant progress, with a hypothetical GPT-5 scoring 64.2%, up from GPT-4o's 35.9%.
Despite improvements, AIs only achieved perfect scores on 2 out of 200 tasks, highlighting current limitations.
The development signals a shift toward evaluating AI on real-world economic value rather than abstract puzzles.

A New Standard for AI Evaluation

The AI Productivity Index, or APEX, represents a significant effort to quantify how well AI can perform economically valuable work. Mercor, a company that provides expert data for AI development, designed the benchmark to move beyond abstract tests and focus on practical, real-world applications.

"How do we get very comprehensive about what it means to be a consultant or a banker or a doctor or lawyer?" said Brendan Foody, the 22-year-old CEO of Mercor. The company's goal was to create a test that deeply probes AI's capabilities in specialized fields.

To achieve this, Mercor invested heavily in human expertise. The company spent more than $500,000 to develop the 200 tasks that make up the APEX benchmark. These tasks were created by professionals with extensive experience in their respective industries.

Harnessing Elite Human Expertise

The creation of the APEX benchmark relied on a team of highly qualified professionals. Mercor contracted experts whose previous employers include top-tier investment banks like Goldman Sachs and JPMorgan, consulting firms such as McKinsey and Boston Consulting Group, and major law firms.

Expertise by the Numbers

The professionals hired by Mercor to design the APEX tasks have an average of 7.25 years of professional experience. The company advertises hourly rates reaching over $200 for senior experts, equivalent to an annual salary of about $400,000.

This approach highlights a growing trend in the AI industry: the need for specialized, high-level human knowledge to both train and evaluate advanced AI systems. The era of relying solely on low-paid crowdworkers for data labeling is evolving, especially for tasks requiring deep domain expertise.

The project also involved high-profile advisors, including a former global managing director of McKinsey and a former dean of Harvard Business School, who helped shape the scope and design of the tasks.

Measuring AI Performance on Professional Tasks

The results from the APEX benchmark show that while AI is improving rapidly, it has not yet reached the level of a human professional. OpenAI's GPT-4o, released in May 2024, achieved a score of 35.9% on the index.

A subsequent, hypothetical model referred to as GPT-5, tested just over a year later, scored 64.2%, demonstrating a significant leap in performance. This score was the highest achieved on the benchmark.

"Getting 100% would mean that you'd basically have an analyst or an associate in a box that you could go and send tasks to," explained Osvald Nitski, one of the paper's authors.

However, the study's authors caution against misinterpreting these scores. A 64.2% score does not mean an AI can deliver 64.2% of the value of a human employee. In many professional contexts, work that is not 100% correct may be effectively useless. The GPT-5 model achieved a perfect score on only two of the 200 tasks.

Current Limitations and Future Directions

The APEX benchmark, while comprehensive, has several limitations that its creators acknowledge. The current version focuses on well-defined tasks, such as diagnosing a patient from provided evidence or building a financial model based on specific assumptions. It does not test more open-ended strategic thinking that often defines high-level professional work.

Furthermore, the evaluation is entirely text-based. It does not assess an AI's ability to use software tools, spreadsheets, or other digital instruments that are essential for a human knowledge worker. Mercor has stated that future versions of APEX will aim to address these limitations.

The Evolution of AI Benchmarking

Early AI tests focused on abstract reasoning puzzles and standardized exams. As models like ChatGPT emerged, benchmarks evolved to include more complex questions created by Ph.D. students. The APEX index represents the next step, directly measuring performance on tasks that mirror the daily work of highly paid professionals.

Another challenge is the complexity of grading. Unlike software engineering, where code can be automatically tested, evaluating a legal memo or a financial valuation is subjective. To overcome this, Mercor uses AI models to assist in grading the outputs, which agreed with human graders 89% of the time, allowing for evaluation at scale.

The Broader Impact on the Job Market

The development of benchmarks like APEX reflects a clear trajectory in AI development. As models become more capable, they are increasingly being tested on tasks with direct economic relevance. This follows a pattern seen in other fields, such as software engineering, where AI capabilities have advanced rapidly after robust benchmarks were established.

The principle that "what gets measured gets done" is highly relevant in the field of AI. By creating a clear target for performance in professional services, the APEX benchmark could accelerate progress in these specific areas.

This progress is also reflected in other studies. An OpenAI benchmark published on September 25 found that expert human evaluators preferred an AI's work over a human's 47.6% of the time across 220 different tasks.

As AI continues to advance, its role in the professional world is becoming a central question. "AI got its Ph.D.," said Foody. "Now it’s starting to enter the job market." The APEX index provides one of the clearest views yet of how well it is prepared for the interview.

Key Takeaways

Mercor launched the AI Productivity Index (APEX) to test AI on professional tasks in law, medicine, finance, and consulting.
The benchmark consists of 200 tasks created by highly-paid experts from top firms like McKinsey and Goldman Sachs.
OpenAI's latest models show significant progress, with a hypothetical GPT-5 scoring 64.2%, up from GPT-4o's 35.9%.
Despite improvements, AIs only achieved perfect scores on 2 out of 200 tasks, highlighting current limitations.
The development signals a shift toward evaluating AI on real-world economic value rather than abstract puzzles.

A New Standard for AI Evaluation

Harnessing Elite Human Expertise

Expertise by the Numbers

Measuring AI Performance on Professional Tasks

"Getting 100% would mean that you'd basically have an analyst or an associate in a box that you could go and send tasks to," explained Osvald Nitski, one of the paper's authors.

Key Takeaways

A New Standard for AI Evaluation

Harnessing Elite Human Expertise

Expertise by the Numbers

Measuring AI Performance on Professional Tasks

Current Limitations and Future Directions

The Evolution of AI Benchmarking

The Broader Impact on the Job Market

Related Articles

Former Intel CEO Leads Push for Christian-Focused AI

Elon Musk Launches AI Encyclopedia Grokipedia

AI-Generated Videos of Hurricane Melissa Flood Social Media

Google AI Tool Now Builds Apps From Simple Text Descriptions

Key Takeaways

A New Standard for AI Evaluation

Harnessing Elite Human Expertise

Expertise by the Numbers

Measuring AI Performance on Professional Tasks

Current Limitations and Future Directions

The Evolution of AI Benchmarking

The Broader Impact on the Job Market