Should AI Be Measuring AI?

June 8, 2026

Julie Deardorff

Michelle. Yin — New research: AI’s estimates of labor-market effects vary dramatically depending on which large language model generates the ratings.

Is your career ‘AI-proof?’ If you’re asking the technology itself for the answer, it can depend on which model you use, according to a new Northwestern University study coauthored by School of Education and Social Policy economist Michelle Yin.

A growing body of research uses artificial intelligence to measure which jobs are most likely to be disrupted by AI. These scores are used by international organizations, consulting firms, financial institutions, and governments to forecast labor market disruption, identify vulnerable sectors, and build job training programs.

But in a new working paper published on the National Bureau of Economic Research website, Yin and her coauthors found that AI’s estimates of labor-market effects vary dramatically depending on which large language model generates the ratings.

The same occupation can appear highly vulnerable under one model and far less under another, with important implications for labor market research, workforce planning, and public policy, the researchers found. For example, accountants are rated highly exposed to AI by Anthropic's Claude 4.5 but receive a much lower exposure ranking from Google's Gemini 2.5. Advertising managers and chief executives show similar disagreement across the four models tested.

“We need careful measurement before making sweeping claims about the future of work,” Yin said.

The researchers asked OpenAI’s original GPT-4, ChatGPT-5, Google Deepmind’s Gemini 2.5, and Anthropic’s Claude 4.5 which jobs were be most affected by AI and often got widely different answers. They used the same rubric, same task descriptions from the U.S. Department of Labor, same data pipeline. The only thing that changed was the AI model doing the rating.

According to Yin, anywhere from 14% to 51% of the average job’s tasks could be affected — a 3.6x difference for the exact same jobs, depending on which AI tool you use to measure the risk.

“The disagreement gets even more stark when you look at “high risk” jobs: one model says only 3% of occupations are seriously threatened, while another says 51% are,” she said. “The models also can't agree on which jobs are most at risk — their rankings barely correlate with each other.”

What does ‘exposure’ mean?

The researchers noted that the AI models are not interchangeable measurement devices. “They are trained on different data, designed with different objectives, and updated continuously,” they wrote. Their disagreement does not necessarily reflect temporary statistical noise. It may instead reveal genuine uncertainty about what “AI exposure” actually means.

“Does exposure mean that some tasks can be partially automated? That most tasks can be replicated? That productivity will rise? That jobs will disappear? These are distinct questions, yet public discussions often collapse them into a single indicator.”

Yin and coauthors Hoa Vu of Northwestern’s Research and Innovation for Social and Economic Inclusion (RISEI) Lab and alumna Claudia Persico (PhD ‘16) of American University, recommend that any study using AI-generated exposure scores report results from multiple models. “I personally would not rely on just one measure to say, ‘Oh, I should change my job,’ or ‘I should change my kid's major,'" Yin told the Wall Street Journal.

But even more fundamentally, researchers should reconsider whether asking a large language model to assess its own capabilities is the right measurement strategy in the first place.

"The approach is circular,” the authors wrote. “The technology being studied serves as the instrument that measures its own reach. The calibration biases we document are inherent to this design. The field cannot simply update exposure scores by re-running an old rubric with a new model without introducing systematic bias relative to the original scores.”

In a companion paper, Yin and coauthor Burhan Ogut extend the argument showing that the choice of AI platform behind an exposure measure can shift downstream employment estimates by 42 to 93 percent.

An unexpected discovery

Yin, a labor economist, said she built her career on the conviction that the way we measure work shapes how we value workers.

When she discovered the instability in AI exposure scores, she was working with vocational rehabilitation and workforce programs in Maine and Virginia, trying to help workforce agencies design programs that respond to a changing labor market, especially for workers with disabilities.

“I care about this because I have sat across the table from workers whose career decisions depend on what researchers like me put into the world, and if those numbers are not credible, we are failing the people we are supposed to serve,” she said.

Yin is director of the Dual Master’s Degree Program in Applied Economics and Social and Economic Policy (jointly offered with The Chinese University of Hong Kong) and founding principal investigator of the Research and Innovation for Social and Economic Inclusion Lab.

Her research investigates how automation, artificial intelligence, and broadband access reshape opportunities in the labor market and intersect with social insurance and healthcare policies such as Medicare, Medicaid, and SSDI.

She was elected to the National Academy of Social Insurance in recognition of her contributions to understanding the future of work and equity in social safety net systems.

More coverage:

Companion working paper: Who Uses AI?
Workers Deserve More Honest Estimates of AI Job Risk, IZA World of Labor, May 27, 2026
When the Ruler is Made of the Thing it Measures VoxEU (European policy publication), May 12, 2026
Why AI's view of the future of work depends on which AI you ask, Financial Times
AI Can’t Agree on Which Jobs AI Might Destroy, May 10, 2026