The assigned tasks mirror those routinely handled by legal professionals, physicians, financial analysts, and management consultants. For instance, one task requires diagnosing a six-year-old patient using nine pieces of multimedia evidence; another seeks legal counsel concerning a musician’s estate; and a third involves evaluating a segment of a healthcare technology firm.
Mercor, asserting its role as a provider of “expert data” to all leading AI firms, reportedly invested over $500,000 to create 200 tasks designed to evaluate AI capabilities in performing “knowledge work with high economic value” across the fields of law, medicine, finance, and management consulting. The Academic Professional Exam (APEX), the outcome of this effort and released on Wednesday, includes a former global managing director from McKinsey, a former dean of Harvard Business School, and a Harvard Law School professor as co-authors. These individuals provided expertise on the task design and scope within their fields, as stated by Mercor. Brendan Foody, Mercor’s 22-year-old CEO, notes that APEX aims for profound depth: “How do we get very comprehensive about what it means to be a consultant or a banker or a doctor or lawyer?”
For task development, Mercor engaged white-collar professionals previously employed by prominent institutions such as leading banks (Goldman Sachs, JPMorgan), consulting firms (McKinsey, Boston Consulting Group), law firms (Latham & Watkins), and hospitals (Mount Sinai). These professionals possess an average of 7.25 years of experience, and their compensation at Mercor rivals what they earned at their prior, highly esteemed workplaces. Mercor’s official site promotes an average hourly wage of $81, escalating to more than $200 per hour—which translates to an approximate annual salary of $400,000—for “Senior Domain Experts,” a role requiring a minimum of four years of professional experience.
Matt Seck, a former investment banking analyst at Bank of America, who is currently contracted by Mercor to develop finance-related tasks mirroring those in the study, commented, “It’s hard to imagine a better hourly job from a pay perspective.”
While benchmarks have traditionally been employed to gauge AI proficiency, directly measuring AI models’ capacity for economically valuable work signifies a “paradigm shift,” according to Osvald Nitski, one of the paper’s authors. Nitski explains that achieving 100% on Mercor’s benchmark would essentially mean possessing “an analyst or an associate in a box that you could go and send tasks to, and then they deliver it to the requirements of a partner, or an MD, or whoever would be grading the work of that person.”
Current models have not yet reached that level, but their progress is rapid. OpenAI’s GPT-4o, launched in May 2024, scored 35.9% on the benchmark. GPT-5, introduced just over a year subsequently, attained 64.2%, representing the highest score on the benchmark. However, a 64.2% score does not imply that GPT-5 delivers 64.2% of a human worker’s value; the paper’s authors note that work falling short of 100% “might be effectively useless.” GPT-5 secured perfect scores on only two of the 200 tasks—one legal and one investment banking—which, Mercor states, “primarily involve basic reasoning, simple calculations, and a lot of basic information searching.”
Even if a model were to achieve 100% on Mercor’s benchmark, it would likely serve as an inadequate replacement for human professionals. The tasks within Mercor’s benchmark concentrate on “well scoped deliverables,” such as formulating diagnoses or constructing financial models, rather than more ambiguous tasks that could have multiple correct solutions. This necessitates that task descriptions incorporate numerous assumptions to ensure a precisely defined desired output. The AI outputs are exclusively text-based, implying the benchmark does not evaluate an AI’s capacity to operate a computer, as a human worker would. (Mercor indicates that subsequent APEX versions will tackle these constraints.) Furthermore, crafting the extensive prompts required for models to complete the tasks “would be more tedious than just doing it yourself,” states Seck.
Nonetheless, indications suggest that AI models are growing increasingly competitive with human capabilities. A separate benchmark, released Thursday, September 25, by OpenAI, demonstrated that expert human assessors favored AI-generated work over human work 47.6% of the time across 220 tasks, which included designing a property sales brochure and evaluating images of skin lesions. OpenAI also discovered that its models’ performance significantly improved within a brief period, more than doubling their “win rate” against humans between June 2024 and September 2025.
As model capabilities have advanced, so has the complexity of the tasks they are being tested on, along with the human expertise required to develop sufficiently challenging evaluations. Earlier tests measured relatively abstract capabilities on and questions. prior to the 2022 launch of ChatGPT, data was frequently sourced from crowdworker services, which compensated workers a an hour. By 2023, Ph.D. students were being to create challenging multiple-choice questions in biology, physics, and chemistry. In September, xAI laid off 500 of its “generalist” data workers as part of an “expansion and prioritization” of the company’s “specialist” data workers. While low-paid data workers to the development of AI models, the upper echelon of skill and remuneration needed to develop AI benchmarks is rapidly ascending.
Nitski states that directly quantifying the utility of AI models for economically valuable tasks is “very hard to pull off.” Success criteria in fields like finance and consulting are more challenging to define compared to, for instance, software engineering. Even with ideal criteria, evaluating AI outputs at scale is more difficult than in software engineering, where automated tests can verify code functionality. This partly accounts for why tests designed to measure the real-world utility of AI models have for software engineering since at least 2023, but have been slower to emerge in other white-collar domains. Nevertheless, as AIs have advanced, they have aided in resolving the issue of grading complex tasks. While human experts draft the success criteria for Mercor’s tasks, the evaluation is performed by AIs, which Mercor reports agreed with human graders 89% of the time, thereby facilitating scalable assessments.
The development of benchmarks extends beyond merely understanding model capabilities. In the realm of AI, akin to business, “what gets measured gets done”—meaning effective tests often accelerate AI progress on those specific evaluations. Foody explains, “It’s ultimately the same data type for both evaluation and training.” Assessing performance in games like Go is uncomplicated; AI was defeating masters by 2016. In 2023, benchmarks for evaluating AIs on real-world software engineering tasks. Two years later, the employment outlook for junior programmers dubious.
Foody remarked, “AI got its Ph.D. Now it’s starting to enter the job market.”