Anthropic's AI Kept Outperforming Humans on Its Own Hiring Test

AI research firm Anthropic faced a unique challenge in its hiring process: its own AI models became so advanced they could outperform top human candidates on the company's engineering test. This led to a complete overhaul of how the company evaluates technical skill, forcing them to abandon realistic problems for more abstract puzzles.

The situation reveals the accelerating pace of AI capabilities and raises fundamental questions about how to measure human expertise in a world where AI can solve complex technical problems in minutes. Anthropic has now redesigned its test three times to stay ahead of its flagship model, Claude.

Key Takeaways

Anthropic's performance engineering hiring test was repeatedly solved by its own AI model, Claude.
The latest model, Claude Opus 4.5, matched the best human performance within the two-hour test limit.
The company had to abandon a realistic, work-related test for an abstract, puzzle-based one to continue evaluating human candidates effectively.
Anthropic has released the original test as an open challenge, inviting engineers who can beat Claude's score to apply.

The Original Challenge: A Realistic Test

In late 2023, Anthropic needed to hire more performance engineers to manage its growing computing infrastructure. To screen a large number of applicants efficiently, Tristan Hume, a lead on the performance optimization team, designed a take-home test.

The goal was to create an engaging problem that mirrored the real work engineers do at the company. Candidates were given code to optimize for a simulated computer accelerator, a task that required understanding complex systems and careful problem-solving.

What is Performance Engineering?

Performance engineers specialize in making software and systems run faster and more efficiently. In AI, this is crucial for training and running large models, as it can save millions of dollars in computing costs and significantly speed up research and development.

The test was initially a success. It allowed candidates a four-hour window to work in their own environment, a setup considered more realistic than a high-pressure live interview. The company successfully hired dozens of engineers using this method, with many candidates providing positive feedback about the engaging nature of the challenge.

Over 1,000 candidates completed the initial version of the test, and it proved to be a strong predictor of on-the-job success. Some of the highest-scoring individuals were recent graduates who demonstrated exceptional skill, allowing Anthropic to hire them with confidence.

When the Student Becomes the Master

The effectiveness of the test began to wane as Anthropic's own AI models grew more powerful. The first sign of trouble came when pre-release versions of Claude 3 models started performing well on the test. By the time an early version of Claude Opus 4 was tested, it produced a more optimized solution than nearly every human applicant had managed within the time limit.

This forced the first major redesign. The team created a more advanced version of the problem, effectively starting candidates at the point where Claude Opus 4 began to struggle. The time limit was also shortened from four to two hours to streamline the hiring process.

The Tipping Point: According to Hume, by May 2025, over 50% of human candidates would have achieved a better score by simply delegating the entire test to an available version of Claude.

This second version of the test worked for several months. However, the next generation of AI, Claude Opus 4.5, proved to be another leap forward. When tested, the model not only solved the initial challenges but also identified and overcame a complex bottleneck that stumped most human candidates.

After about an hour, Claude Opus 4.5 reached the passing threshold. Within two hours, its score matched the best performance ever submitted by a human candidate.

Anthropic was at a crossroads. Its primary tool for identifying top human talent was now being mastered by the very technology those engineers were being hired to build.

Rethinking How to Test Human Skill

The team considered several options. Banning the use of AI during the test was dismissed as impractical to enforce and contrary to how engineers work today. Simply raising the performance bar was also not viable, as humans need time to read and understand the problem, while an AI could start optimizing immediately.

The core issue was that AI models, trained on vast amounts of public code and technical documentation, had become experts at solving problems that resemble real-world engineering tasks.

A New Approach: Abstract Puzzles

The solution was to create a problem that was unlike typical programming challenges. The team drew inspiration from puzzle games like Shenzhen I/O, which are known for their unusual and highly constrained programming environments.

The third and current version of the take-home test consists of a series of puzzles using a tiny, custom instruction set. Candidates must find the most efficient solution with the fewest number of instructions. This abstract nature makes it difficult for an AI to rely on its training data.

Key features of the new test include:

Unconventional Rules: The programming environment is intentionally strange, forcing candidates to think from first principles.
No Debugging Tools: The test deliberately provides no visualization or debugging aids. A key part of the evaluation is seeing if candidates build their own tools to analyze the problem.
AI-Resistant (For Now): Early trials showed that human engineers could consistently outperform Claude Opus 4.5 on these puzzles.

While this new test is proving effective, it marks a significant departure from the original goal of simulating real work. To find problems that still require human ingenuity, Anthropic had to create a scenario that was intentionally unrealistic.

An Open Challenge to Engineers

In a move to engage the broader engineering community, Anthropic has publicly released the original take-home test. The company acknowledges that while AI is faster in short sprints, human experts still hold an advantage given unlimited time.

They have issued a challenge: anyone who can write code that runs faster than Claude's best effort is encouraged to apply for a role. The benchmark to beat is a score of 1487 cycles, achieved by Claude Opus 4.5 after 11.5 hours of optimization.

Performance Benchmarks (Lower is Better)

Claude Opus 4.5 (2 hours): 1579 cycles
Claude Opus 4.5 (11.5 hours): 1487 cycles
Best Human Performance (Unlimited Time): Substantially faster than Claude's best score.

This ongoing race between human and artificial intelligence within Anthropic's own walls serves as a powerful illustration of the current moment in technology. As AI capabilities continue to expand, the definition of expert human skill—and how we measure it—is being reshaped in real time.

Key Takeaways

Anthropic's performance engineering hiring test was repeatedly solved by its own AI model, Claude.
The latest model, Claude Opus 4.5, matched the best human performance within the two-hour test limit.
The company had to abandon a realistic, work-related test for an abstract, puzzle-based one to continue evaluating human candidates effectively.
Anthropic has released the original test as an open challenge, inviting engineers who can beat Claude's score to apply.

The Original Challenge: A Realistic Test

What is Performance Engineering?

When the Student Becomes the Master

The Tipping Point: According to Hume, by May 2025, over 50% of human candidates would have achieved a better score by simply delegating the entire test to an available version of Claude.

After about an hour, Claude Opus 4.5 reached the passing threshold. Within two hours, its score matched the best performance ever submitted by a human candidate.

Anthropic was at a crossroads. Its primary tool for identifying top human talent was now being mastered by the very technology those engineers were being hired to build.

Rethinking How to Test Human Skill

The core issue was that AI models, trained on vast amounts of public code and technical documentation, had become experts at solving problems that resemble real-world engineering tasks.

A New Approach: Abstract Puzzles

Key features of the new test include:

Unconventional Rules: The programming environment is intentionally strange, forcing candidates to think from first principles.
No Debugging Tools: The test deliberately provides no visualization or debugging aids. A key part of the evaluation is seeing if candidates build their own tools to analyze the problem.
AI-Resistant (For Now): Early trials showed that human engineers could consistently outperform Claude Opus 4.5 on these puzzles.

An Open Challenge to Engineers

Performance Benchmarks (Lower is Better)

Claude Opus 4.5 (2 hours): 1579 cycles
Claude Opus 4.5 (11.5 hours): 1487 cycles
Best Human Performance (Unlimited Time): Substantially faster than Claude's best score.

Key Takeaways

The Original Challenge: A Realistic Test

What is Performance Engineering?

When the Student Becomes the Master

Rethinking How to Test Human Skill

A New Approach: Abstract Puzzles

An Open Challenge to Engineers

Performance Benchmarks (Lower is Better)

Related Articles

Tesla to Launch 'Terafab' AI Chip Project Next Week

IIT Mumbai Graduate Joins Elon Musk's Super Intelligence Team

AI Set for Major Leap by 2026, Report Warns

Google Maps Overhauls Navigation with New AI Features

Key Takeaways

The Original Challenge: A Realistic Test

What is Performance Engineering?

When the Student Becomes the Master

Rethinking How to Test Human Skill

A New Approach: Abstract Puzzles

An Open Challenge to Engineers

Performance Benchmarks (Lower is Better)