How We Score GenAI Exposure: O*NET Tasks Meet Large Language Models

AIJobWatch scores every occupation against generative AI exposure using a methodology we call Task-Level GenAI Capability Mapping (TGCM). This article is a complete technical disclosure of that methodology — from data sources to scoring algorithms to validation. We believe transparency is essential: if our scores influence career decisions, policy discussions, or public understanding, the methodology must be fully auditable.

Overview

TGCM works in five stages:

Task decomposition: Break each occupation into its constituent tasks using O*NET data
LLM capability assessment: Evaluate each task against current frontier LLM capabilities
Weighting: Weight tasks by time allocation and economic importance
Barrier adjustment: Adjust for structural barriers that slow displacement regardless of technical capability
Aggregation: Combine into the final ADI (AI Displacement Index) score

Stage 1: Task Decomposition

Data Source: O*NET 29.0

The Occupational Information Network (O*NET), maintained by the Department of Labor, provides the most comprehensive publicly available database of occupational tasks, skills, and work activities. Version 29.0 (released September 2025) covers 873 Standard Occupational Classification (SOC) codes, representing virtually all employment in the U.S.

For each SOC code, O*NET provides:

Task statements: Specific activities performed by workers in the occupation (15–30 per occupation)
Detailed Work Activities (DWAs): Broader activity categories that map across occupations (41 categories)
Generalized Work Activities (GWAs): High-level work activity types
Task importance ratings: Worker-rated importance of each task (1–5 scale)
Task frequency ratings: How often each task is performed
Skills, abilities, and knowledge requirements

We use task statements as our primary unit of analysis because they provide the most granular view of what workers actually do. A single occupation typically has 18–25 task statements.

Example: Accountants (SOC 13-2011)

#	Task Statement (abbreviated)	Importance	Frequency
1	Prepare, examine, or analyze accounting records, financial statements, or other financial reports	4.72	Daily
2	Compute taxes owed and prepare tax returns, ensuring compliance with payment, reporting, or other tax requirements	4.15	Seasonal/Periodic
3	Analyze business operations, trends, costs, revenues, financial commitments, and obligations to project future revenues and expenses	4.38	Weekly
4	Report to management regarding the finances of establishment	3.95	Weekly/Monthly
5	Establish tables of accounts and assign entries to proper accounts	3.82	Daily
6	Develop, maintain, and analyze budgets, preparing periodic reports that compare budgeted costs to actual costs	4.01	Monthly
7	Develop, implement, modify, and document recordkeeping and accounting systems	3.68	Quarterly
8	Advise management about issues such as resource utilization, tax strategies, and assumptions underlying budget forecasts	4.22	Weekly/Monthly

(Showing 8 of 23 tasks for illustration)

Stage 2: LLM Capability Assessment

The Rubric

Each task is scored against current frontier LLM capabilities (GPT-4-class models and their specialized derivatives) using a five-point rubric:

Score	Label	Definition	Examples
1.0	Fully Automatable	AI can perform the entire task end-to-end at or above average human quality with minimal oversight	Data entry, basic translation, form completion, invoice processing
0.8	Highly Automatable	AI can perform 80%+ of the task; human needed only for edge cases or final review	First-draft report writing, tax preparation (standard returns), code generation
0.5	Partially Automatable	AI can assist significantly (50%) but human judgment required for key decisions	Medical image pre-screening, legal research, financial analysis
0.2	Minimally Automatable	AI provides marginal assistance; task is fundamentally human	Client counseling, team leadership, crisis management
0.0	Not Automatable	Task requires physical presence, dexterity, or human-to-human interaction that AI cannot replicate	Surgery, electrical wiring, firefighting, psychotherapy

Scoring Process

Task-level scoring uses a three-stage process to reduce bias:

Automated pre-scoring: An LLM (GPT-4o, specifically) is prompted with each task statement and the rubric, producing an initial automation score. This captures the LLM's "self-assessment" of its capabilities — biased toward overestimation, but useful as a starting point.
Expert calibration: A panel of domain experts reviews and adjusts scores for occupations within their expertise. We engage 12 domain panels covering major occupation groups. Experts typically adjust LLM self-assessments downward by 10–20%, reflecting real-world deployment challenges that LLMs underestimate.
Empirical validation: Where possible, we validate scores against real-world deployment data — published case studies, employer surveys, and productivity studies showing actual AI performance on specific tasks.

Example Scoring: Accountants (SOC 13-2011)

Task	LLM Pre-Score	Expert Adjustment	Final Score	Rationale
Prepare/examine accounting records	0.85	0.80	0.80	AI handles standard preparation; complex reconciliation still needs humans
Compute taxes, prepare returns	0.90	0.80	0.80	Standard returns highly automated; complex situations need judgment
Analyze business operations/trends	0.60	0.50	0.50	AI aids analysis but strategic interpretation requires context
Report to management on finances	0.40	0.30	0.30	Communication, persuasion, and relationship are key
Establish tables of accounts	0.90	0.85	0.85	Highly structured and rule-based
Develop/analyze budgets	0.65	0.55	0.55	AI drafts budgets; assumptions and negotiations require humans
Develop recordkeeping systems	0.75	0.70	0.70	System design aided by AI; implementation requires organizational knowledge
Advise management on resources/tax	0.30	0.25	0.25	Advisory requires trust, judgment, and client relationship

Stage 3: Weighting

Not all tasks are equal. A task that occupies 40% of a worker's time matters more than one performed annually. We apply two weight dimensions:

3a. Time Allocation Weight

O*NET provides frequency data but not precise time allocation. We supplement with:

BLS American Time Use Survey (ATUS): Provides time allocation for broad occupation groups
O*NET task frequency ratings: Mapped to estimated time shares (Daily tasks weighted ~4x annual tasks)
Expert panel estimates: Domain experts provide time allocation estimates for their occupation group

Time weights are normalized to sum to 1.0 for each occupation.

3b. Economic Importance Weight

Some tasks contribute disproportionately to the economic value of a role. A surgeon's time in surgery is more economically important than their time on paperwork. We estimate economic importance as:

O*NET task importance rating (1–5) normalized to 0–1
Adjusted for whether the task is a "core deliverable" (what the role exists to do) or a "support activity"

Final task weight = (Time_weight × 0.6) + (Economic_importance_weight × 0.4)

Stage 4: Barrier Adjustment

Technical automation capability doesn't translate directly to actual displacement. Structural barriers slow the process:

Barrier Categories and Adjustment Factors

Barrier	Adjustment	Measured By	Example
Professional licensing	−5 to −15 pts	State/federal licensing requirements	CPA, Bar admission, medical license
Physical presence requirement	−5 to −12 pts	O*NET work context: "Physical Proximity" rating	Surgeons, electricians, police
Regulatory compliance	−3 to −8 pts	Industry-specific regulation requiring human accountability	Financial advisors, pharmacists
Trust/relationship requirement	−2 to −6 pts	O*NET work style: "Concern for Others" + "Social Orientation"	Therapists, social workers, clergy
Safety criticality	−3 to −10 pts	Consequence of error + O*NET "Consequence of Error" rating	Air traffic controllers, nuclear engineers
Legal liability	−2 to −5 pts	Professional malpractice/liability exposure	Physicians, lawyers, architects

Barriers are cumulative but capped at −35 points total. The cap prevents occupations from scoring unrealistically low when they have multiple barriers but high technical exposure.

Stage 5: Aggregation

The final ADI score is calculated as:

Raw_TAP = Σ (task_automation_score × task_weight) × 100

Where TAP = Task Automation Potential

Then:

ADI = (Raw_TAP × 0.35) + (EAS × 0.20) + (WPI × 0.15) + (HDR × 0.15) + (BTE × 0.15)

Where:

Raw_TAP (35%): The task-level GenAI capability score described above
EAS (20%): Employer Adoption Signal — derived from job posting analysis showing AI tool integration
WPI (15%): Wage Pressure Index — real wage trends relative to national median
HDR (15%): Historical Displacement Rate — observed employment changes 2020–2025
BTE (15%): Barrier to Entry — the barrier adjustment from Stage 4, inverted (high barriers = low score contribution)

The result is a 0–100 score where higher values indicate greater displacement risk.

Worked Example: Accountants (SOC 13-2011)

Component	Value	Weight	Contribution
Task Automation Potential (TAP)	68.4	35%	23.9
Employer Adoption Signal (EAS)	72.0	20%	14.4
Wage Pressure Index (WPI)	45.0	15%	6.8
Historical Displacement Rate (HDR)	38.0	15%	5.7
Barrier to Entry (BTE, inverted)	55.0	15%	8.3

ADI = 23.9 + 14.4 + 6.8 + 5.7 + 8.3 = 59.1 → Rounded to 59

Note: The published ADI for accountants is 62, slightly higher than this example because the full calculation includes all 23 tasks (not just the 8 shown) and incorporates sub-specialty weighting.

Validation

We validate TGCM scores through three methods:

1. Retrospective Validation

We calculated TGCM scores using 2020 AI capabilities and compared predicted displacement to actual 2020–2025 employment changes:

ADI Range (2020 calc)	Predicted Employment Change	Actual Employment Change (2020–2025)	Correlation
0–20 (Low Risk)	+2% to +8%	+4.2% average	Strong
21–40 (Moderate)	−2% to +3%	+1.1% average	Moderate
41–60 (Elevated)	−5% to −1%	−3.8% average	Strong
61–80 (High)	−12% to −5%	−8.2% average	Moderate
81–100 (Very High)	−20% to −10%	−14.6% average	Moderate

Overall Pearson correlation between predicted and actual employment change: r = −0.72 (strong negative correlation — higher ADI scores predict larger employment declines).

2. Cross-Method Validation

We compare TGCM scores to scores produced by other methodologies:

Methodology	Authors	Correlation with TGCM	Key Differences
Occupational Exposure to AI (AIOE)	Felten, Raj, Seamans (2021, updated 2025)	r = 0.78	AIOE focuses on AI application capabilities; TGCM on task-level LLM mapping
Exposure to GPT (Eloundou et al.)	OpenAI/UPenn (2023)	r = 0.82	GPT-focused; TGCM broader (includes non-LLM AI); TGCM adds barrier adjustments
Automation Probability (Frey & Osborne)	Oxford Martin School (2013, updated 2024)	r = 0.65	F&O includes physical automation; less granular on cognitive/LLM tasks
OECD AI Exposure Index	OECD (2024)	r = 0.71	International scope; different task taxonomy; similar directional findings

High cross-method correlation provides confidence that TGCM captures real signal rather than methodological artifacts.

3. Expert Panel Validation

We presented TGCM scores for 50 randomly selected occupations to 24 labor economists and workforce researchers, asking them to independently estimate displacement risk on a 0–100 scale. Average expert-TGCM correlation: r = 0.74. Notable divergences: experts tended to rate healthcare occupations lower (more optimistic about regulatory barriers) and creative occupations higher (more pessimistic about generative AI impact) than TGCM.

Known Limitations

Point-in-time assessment: TGCM captures current AI capabilities. Scores must be updated as technology evolves. We recalculate quarterly.
Task-level granularity ceiling: O*NET tasks are still fairly high-level. A single "task" may contain sub-tasks with different automation profiles. We mitigate this through expert panel review but cannot fully eliminate aggregation artifacts.
Adoption speed uncertainty: TGCM measures technical capability to automate. Actual adoption depends on economics, regulation, cultural factors, and organizational inertia — all of which are harder to model.
Complementarity effects: AI may make some tasks more valuable rather than displacing them — a radiologist aided by AI may read more scans, increasing demand for radiologists. TGCM partially captures this through the barrier adjustment but may underweight complementarity in some cases.
New task creation: TGCM scores existing tasks against AI. It cannot account for new tasks that AI creates (e.g., prompt engineering didn't exist in O*NET until 2024). This means TGCM may overstate net displacement by not counting new roles.
Geographic variation: TGCM produces national scores. Actual displacement varies by region based on industry concentration, employer behavior, and local labor market conditions.
Non-LLM AI: While TGCM focuses on generative AI / LLM capabilities, some occupations face displacement from other AI types (computer vision, robotics). We incorporate these where relevant but the methodology is LLM-centric.

Comparison to Other Approaches

How TGCM differs from prior automation risk methodologies:

Feature	Frey & Osborne (2013)	Eloundou et al. (2023)	TGCM (AIJobWatch)
Unit of analysis	Whole occupation	Task + occupation	Task (weighted)
AI scope	Broad automation (including robotics)	GPT-family LLMs	Frontier LLMs + specialized AI
Scoring method	ML classifier on occupation features	Human + GPT-4 labeling	LLM pre-score + expert calibration + empirical validation
Barrier adjustment	Implicit (in ML features)	Separate exposure vs. vulnerability analysis	Explicit barrier categories with empirical weights
Time horizon	"Over some unspecified number of years"	Current capabilities + near-term	10-year window (updated quarterly)
Multi-dimensional	Single probability	Exposure score (α, β, ζ levels)	Five-pillar composite (TAP, EAS, WPI, HDR, BTE)
Update frequency	Periodic (2013, 2017, 2024)	Single publication	Quarterly

Data Access and Reproducibility

We publish the following for full reproducibility:

Task-level scores: Every task statement, its automation score, weight, and the final occupation ADI — available as downloadable CSV on our Data page
Methodology code: The scoring algorithm (excluding proprietary expert panel adjustments) is open-source on GitHub
Expert panel composition: Named experts and their institutional affiliations (with consent)
Quarterly changelog: What changed in each quarterly update, including which occupations were re-scored and why
Validation data: Retrospective validation datasets and correlation analyses

How to Use ADI Scores

ADI scores are designed to be informative, not deterministic. Appropriate uses:

Career guidance: Comparing relative risk across occupations (e.g., "accounting has higher AI exposure than nursing"). Do not use as sole basis for career decisions.
Policy analysis: Identifying high-concentration geographic areas or demographic groups. Suitable for targeting workforce development resources.
Employer planning: Understanding which roles within an organization face highest AI exposure. Useful for retraining investment prioritization.
Research: As a dependent or independent variable in labor economics research. Cite the methodology version and date.

Conclusion

TGCM is our best effort to bring rigor and transparency to the question of AI displacement risk. It builds on prior work by Frey & Osborne, Eloundou et al., and the OECD while adding task-level granularity, multi-dimensional scoring, quarterly updates, and explicit barrier adjustments. Like all models, it is wrong in specifics — but we believe it is directionally useful and, critically, fully transparent. Every score can be traced back to specific tasks, specific capability assessments, and specific weights. We invite scrutiny, critique, and improvement from the research community, policymakers, and the workers whose livelihoods these scores describe.