Foundational20 min readΒ·

How We Score GenAI Exposure: O*NET Tasks Meet Large Language Models

A deep dive into AIJobWatch's methodology for scoring generative AI exposure across 873 occupations β€” from O*NET task decomposition to LLM capability mapping.

AIJobWatch scores every occupation against generative AI exposure using a methodology we call Task-Level GenAI Capability Mapping (TGCM). This article is a complete technical disclosure of that methodology β€” from data sources to scoring algorithms to validation. We believe transparency is essential: if our scores influence career decisions, policy discussions, or public understanding, the methodology must be fully auditable.

Overview

TGCM works in five stages:

  1. Task decomposition: Break each occupation into its constituent tasks using O*NET data
  2. LLM capability assessment: Evaluate each task against current frontier LLM capabilities
  3. Weighting: Weight tasks by time allocation and economic importance
  4. Barrier adjustment: Adjust for structural barriers that slow displacement regardless of technical capability
  5. Aggregation: Combine into the final ADI (AI Displacement Index) score

Stage 1: Task Decomposition

Data Source: O*NET 29.0

The Occupational Information Network (O*NET), maintained by the Department of Labor, provides the most comprehensive publicly available database of occupational tasks, skills, and work activities. Version 29.0 (released September 2025) covers 873 Standard Occupational Classification (SOC) codes, representing virtually all employment in the U.S.

For each SOC code, O*NET provides:

  • Task statements: Specific activities performed by workers in the occupation (15–30 per occupation)
  • Detailed Work Activities (DWAs): Broader activity categories that map across occupations (41 categories)
  • Generalized Work Activities (GWAs): High-level work activity types
  • Task importance ratings: Worker-rated importance of each task (1–5 scale)
  • Task frequency ratings: How often each task is performed
  • Skills, abilities, and knowledge requirements

We use task statements as our primary unit of analysis because they provide the most granular view of what workers actually do. A single occupation typically has 18–25 task statements.

Example: Accountants (SOC 13-2011)

#Task Statement (abbreviated)ImportanceFrequency
1Prepare, examine, or analyze accounting records, financial statements, or other financial reports4.72Daily
2Compute taxes owed and prepare tax returns, ensuring compliance with payment, reporting, or other tax requirements4.15Seasonal/Periodic
3Analyze business operations, trends, costs, revenues, financial commitments, and obligations to project future revenues and expenses4.38Weekly
4Report to management regarding the finances of establishment3.95Weekly/Monthly
5Establish tables of accounts and assign entries to proper accounts3.82Daily
6Develop, maintain, and analyze budgets, preparing periodic reports that compare budgeted costs to actual costs4.01Monthly
7Develop, implement, modify, and document recordkeeping and accounting systems3.68Quarterly
8Advise management about issues such as resource utilization, tax strategies, and assumptions underlying budget forecasts4.22Weekly/Monthly

(Showing 8 of 23 tasks for illustration)

Stage 2: LLM Capability Assessment

The Rubric

Each task is scored against current frontier LLM capabilities (GPT-4-class models and their specialized derivatives) using a five-point rubric:

ScoreLabelDefinitionExamples
1.0Fully AutomatableAI can perform the entire task end-to-end at or above average human quality with minimal oversightData entry, basic translation, form completion, invoice processing
0.8Highly AutomatableAI can perform 80%+ of the task; human needed only for edge cases or final reviewFirst-draft report writing, tax preparation (standard returns), code generation
0.5Partially AutomatableAI can assist significantly (50%) but human judgment required for key decisionsMedical image pre-screening, legal research, financial analysis
0.2Minimally AutomatableAI provides marginal assistance; task is fundamentally humanClient counseling, team leadership, crisis management
0.0Not AutomatableTask requires physical presence, dexterity, or human-to-human interaction that AI cannot replicateSurgery, electrical wiring, firefighting, psychotherapy

Scoring Process

Task-level scoring uses a three-stage process to reduce bias:

  1. Automated pre-scoring: An LLM (GPT-4o, specifically) is prompted with each task statement and the rubric, producing an initial automation score. This captures the LLM's "self-assessment" of its capabilities β€” biased toward overestimation, but useful as a starting point.
  2. Expert calibration: A panel of domain experts reviews and adjusts scores for occupations within their expertise. We engage 12 domain panels covering major occupation groups. Experts typically adjust LLM self-assessments downward by 10–20%, reflecting real-world deployment challenges that LLMs underestimate.
  3. Empirical validation: Where possible, we validate scores against real-world deployment data β€” published case studies, employer surveys, and productivity studies showing actual AI performance on specific tasks.

Example Scoring: Accountants (SOC 13-2011)

TaskLLM Pre-ScoreExpert AdjustmentFinal ScoreRationale
Prepare/examine accounting records0.850.800.80AI handles standard preparation; complex reconciliation still needs humans
Compute taxes, prepare returns0.900.800.80Standard returns highly automated; complex situations need judgment
Analyze business operations/trends0.600.500.50AI aids analysis but strategic interpretation requires context
Report to management on finances0.400.300.30Communication, persuasion, and relationship are key
Establish tables of accounts0.900.850.85Highly structured and rule-based
Develop/analyze budgets0.650.550.55AI drafts budgets; assumptions and negotiations require humans
Develop recordkeeping systems0.750.700.70System design aided by AI; implementation requires organizational knowledge
Advise management on resources/tax0.300.250.25Advisory requires trust, judgment, and client relationship

Stage 3: Weighting

Not all tasks are equal. A task that occupies 40% of a worker's time matters more than one performed annually. We apply two weight dimensions:

3a. Time Allocation Weight

O*NET provides frequency data but not precise time allocation. We supplement with:

  • BLS American Time Use Survey (ATUS): Provides time allocation for broad occupation groups
  • O*NET task frequency ratings: Mapped to estimated time shares (Daily tasks weighted ~4x annual tasks)
  • Expert panel estimates: Domain experts provide time allocation estimates for their occupation group

Time weights are normalized to sum to 1.0 for each occupation.

3b. Economic Importance Weight

Some tasks contribute disproportionately to the economic value of a role. A surgeon's time in surgery is more economically important than their time on paperwork. We estimate economic importance as:

  • O*NET task importance rating (1–5) normalized to 0–1
  • Adjusted for whether the task is a "core deliverable" (what the role exists to do) or a "support activity"

Final task weight = (Time_weight Γ— 0.6) + (Economic_importance_weight Γ— 0.4)

Stage 4: Barrier Adjustment

Technical automation capability doesn't translate directly to actual displacement. Structural barriers slow the process:

Barrier Categories and Adjustment Factors

BarrierAdjustmentMeasured ByExample
Professional licensingβˆ’5 to βˆ’15 ptsState/federal licensing requirementsCPA, Bar admission, medical license
Physical presence requirementβˆ’5 to βˆ’12 ptsO*NET work context: "Physical Proximity" ratingSurgeons, electricians, police
Regulatory complianceβˆ’3 to βˆ’8 ptsIndustry-specific regulation requiring human accountabilityFinancial advisors, pharmacists
Trust/relationship requirementβˆ’2 to βˆ’6 ptsO*NET work style: "Concern for Others" + "Social Orientation"Therapists, social workers, clergy
Safety criticalityβˆ’3 to βˆ’10 ptsConsequence of error + O*NET "Consequence of Error" ratingAir traffic controllers, nuclear engineers
Legal liabilityβˆ’2 to βˆ’5 ptsProfessional malpractice/liability exposurePhysicians, lawyers, architects

Barriers are cumulative but capped at βˆ’35 points total. The cap prevents occupations from scoring unrealistically low when they have multiple barriers but high technical exposure.

Stage 5: Aggregation

The final ADI score is calculated as:

Raw_TAP = Ξ£ (task_automation_score Γ— task_weight) Γ— 100

Where TAP = Task Automation Potential

Then:

ADI = (Raw_TAP Γ— 0.35) + (EAS Γ— 0.20) + (WPI Γ— 0.15) + (HDR Γ— 0.15) + (BTE Γ— 0.15)

Where:

  • Raw_TAP (35%): The task-level GenAI capability score described above
  • EAS (20%): Employer Adoption Signal β€” derived from job posting analysis showing AI tool integration
  • WPI (15%): Wage Pressure Index β€” real wage trends relative to national median
  • HDR (15%): Historical Displacement Rate β€” observed employment changes 2020–2025
  • BTE (15%): Barrier to Entry β€” the barrier adjustment from Stage 4, inverted (high barriers = low score contribution)

The result is a 0–100 score where higher values indicate greater displacement risk.

Worked Example: Accountants (SOC 13-2011)

ComponentValueWeightContribution
Task Automation Potential (TAP)68.435%23.9
Employer Adoption Signal (EAS)72.020%14.4
Wage Pressure Index (WPI)45.015%6.8
Historical Displacement Rate (HDR)38.015%5.7
Barrier to Entry (BTE, inverted)55.015%8.3

ADI = 23.9 + 14.4 + 6.8 + 5.7 + 8.3 = 59.1 β†’ Rounded to 59

Note: The published ADI for accountants is 62, slightly higher than this example because the full calculation includes all 23 tasks (not just the 8 shown) and incorporates sub-specialty weighting.

Validation

We validate TGCM scores through three methods:

1. Retrospective Validation

We calculated TGCM scores using 2020 AI capabilities and compared predicted displacement to actual 2020–2025 employment changes:

ADI Range (2020 calc)Predicted Employment ChangeActual Employment Change (2020–2025)Correlation
0–20 (Low Risk)+2% to +8%+4.2% averageStrong
21–40 (Moderate)βˆ’2% to +3%+1.1% averageModerate
41–60 (Elevated)βˆ’5% to βˆ’1%βˆ’3.8% averageStrong
61–80 (High)βˆ’12% to βˆ’5%βˆ’8.2% averageModerate
81–100 (Very High)βˆ’20% to βˆ’10%βˆ’14.6% averageModerate

Overall Pearson correlation between predicted and actual employment change: r = βˆ’0.72 (strong negative correlation β€” higher ADI scores predict larger employment declines).

2. Cross-Method Validation

We compare TGCM scores to scores produced by other methodologies:

MethodologyAuthorsCorrelation with TGCMKey Differences
Occupational Exposure to AI (AIOE)Felten, Raj, Seamans (2021, updated 2025)r = 0.78AIOE focuses on AI application capabilities; TGCM on task-level LLM mapping
Exposure to GPT (Eloundou et al.)OpenAI/UPenn (2023)r = 0.82GPT-focused; TGCM broader (includes non-LLM AI); TGCM adds barrier adjustments
Automation Probability (Frey & Osborne)Oxford Martin School (2013, updated 2024)r = 0.65F&O includes physical automation; less granular on cognitive/LLM tasks
OECD AI Exposure IndexOECD (2024)r = 0.71International scope; different task taxonomy; similar directional findings

High cross-method correlation provides confidence that TGCM captures real signal rather than methodological artifacts.

3. Expert Panel Validation

We presented TGCM scores for 50 randomly selected occupations to 24 labor economists and workforce researchers, asking them to independently estimate displacement risk on a 0–100 scale. Average expert-TGCM correlation: r = 0.74. Notable divergences: experts tended to rate healthcare occupations lower (more optimistic about regulatory barriers) and creative occupations higher (more pessimistic about generative AI impact) than TGCM.

Known Limitations

  1. Point-in-time assessment: TGCM captures current AI capabilities. Scores must be updated as technology evolves. We recalculate quarterly.
  2. Task-level granularity ceiling: O*NET tasks are still fairly high-level. A single "task" may contain sub-tasks with different automation profiles. We mitigate this through expert panel review but cannot fully eliminate aggregation artifacts.
  3. Adoption speed uncertainty: TGCM measures technical capability to automate. Actual adoption depends on economics, regulation, cultural factors, and organizational inertia β€” all of which are harder to model.
  4. Complementarity effects: AI may make some tasks more valuable rather than displacing them β€” a radiologist aided by AI may read more scans, increasing demand for radiologists. TGCM partially captures this through the barrier adjustment but may underweight complementarity in some cases.
  5. New task creation: TGCM scores existing tasks against AI. It cannot account for new tasks that AI creates (e.g., prompt engineering didn't exist in O*NET until 2024). This means TGCM may overstate net displacement by not counting new roles.
  6. Geographic variation: TGCM produces national scores. Actual displacement varies by region based on industry concentration, employer behavior, and local labor market conditions.
  7. Non-LLM AI: While TGCM focuses on generative AI / LLM capabilities, some occupations face displacement from other AI types (computer vision, robotics). We incorporate these where relevant but the methodology is LLM-centric.

Comparison to Other Approaches

How TGCM differs from prior automation risk methodologies:

FeatureFrey & Osborne (2013)Eloundou et al. (2023)TGCM (AIJobWatch)
Unit of analysisWhole occupationTask + occupationTask (weighted)
AI scopeBroad automation (including robotics)GPT-family LLMsFrontier LLMs + specialized AI
Scoring methodML classifier on occupation featuresHuman + GPT-4 labelingLLM pre-score + expert calibration + empirical validation
Barrier adjustmentImplicit (in ML features)Separate exposure vs. vulnerability analysisExplicit barrier categories with empirical weights
Time horizon"Over some unspecified number of years"Current capabilities + near-term10-year window (updated quarterly)
Multi-dimensionalSingle probabilityExposure score (Ξ±, Ξ², ΞΆ levels)Five-pillar composite (TAP, EAS, WPI, HDR, BTE)
Update frequencyPeriodic (2013, 2017, 2024)Single publicationQuarterly

Data Access and Reproducibility

We publish the following for full reproducibility:

  • Task-level scores: Every task statement, its automation score, weight, and the final occupation ADI β€” available as downloadable CSV on our Data page
  • Methodology code: The scoring algorithm (excluding proprietary expert panel adjustments) is open-source on GitHub
  • Expert panel composition: Named experts and their institutional affiliations (with consent)
  • Quarterly changelog: What changed in each quarterly update, including which occupations were re-scored and why
  • Validation data: Retrospective validation datasets and correlation analyses

How to Use ADI Scores

ADI scores are designed to be informative, not deterministic. Appropriate uses:

  • Career guidance: Comparing relative risk across occupations (e.g., "accounting has higher AI exposure than nursing"). Do not use as sole basis for career decisions.
  • Policy analysis: Identifying high-concentration geographic areas or demographic groups. Suitable for targeting workforce development resources.
  • Employer planning: Understanding which roles within an organization face highest AI exposure. Useful for retraining investment prioritization.
  • Research: As a dependent or independent variable in labor economics research. Cite the methodology version and date.

Conclusion

TGCM is our best effort to bring rigor and transparency to the question of AI displacement risk. It builds on prior work by Frey & Osborne, Eloundou et al., and the OECD while adding task-level granularity, multi-dimensional scoring, quarterly updates, and explicit barrier adjustments. Like all models, it is wrong in specifics β€” but we believe it is directionally useful and, critically, fully transparent. Every score can be traced back to specific tasks, specific capability assessments, and specific weights. We invite scrutiny, critique, and improvement from the research community, policymakers, and the workers whose livelihoods these scores describe.

Related Analysis