AIJobWatch scores every occupation against generative AI exposure using a methodology we call Task-Level GenAI Capability Mapping (TGCM). This article is a complete technical disclosure of that methodology β from data sources to scoring algorithms to validation. We believe transparency is essential: if our scores influence career decisions, policy discussions, or public understanding, the methodology must be fully auditable.
Overview
TGCM works in five stages:
- Task decomposition: Break each occupation into its constituent tasks using O*NET data
- LLM capability assessment: Evaluate each task against current frontier LLM capabilities
- Weighting: Weight tasks by time allocation and economic importance
- Barrier adjustment: Adjust for structural barriers that slow displacement regardless of technical capability
- Aggregation: Combine into the final ADI (AI Displacement Index) score
Stage 1: Task Decomposition
Data Source: O*NET 29.0
The Occupational Information Network (O*NET), maintained by the Department of Labor, provides the most comprehensive publicly available database of occupational tasks, skills, and work activities. Version 29.0 (released September 2025) covers 873 Standard Occupational Classification (SOC) codes, representing virtually all employment in the U.S.
For each SOC code, O*NET provides:
- Task statements: Specific activities performed by workers in the occupation (15β30 per occupation)
- Detailed Work Activities (DWAs): Broader activity categories that map across occupations (41 categories)
- Generalized Work Activities (GWAs): High-level work activity types
- Task importance ratings: Worker-rated importance of each task (1β5 scale)
- Task frequency ratings: How often each task is performed
- Skills, abilities, and knowledge requirements
We use task statements as our primary unit of analysis because they provide the most granular view of what workers actually do. A single occupation typically has 18β25 task statements.
Example: Accountants (SOC 13-2011)
| # | Task Statement (abbreviated) | Importance | Frequency |
|---|---|---|---|
| 1 | Prepare, examine, or analyze accounting records, financial statements, or other financial reports | 4.72 | Daily |
| 2 | Compute taxes owed and prepare tax returns, ensuring compliance with payment, reporting, or other tax requirements | 4.15 | Seasonal/Periodic |
| 3 | Analyze business operations, trends, costs, revenues, financial commitments, and obligations to project future revenues and expenses | 4.38 | Weekly |
| 4 | Report to management regarding the finances of establishment | 3.95 | Weekly/Monthly |
| 5 | Establish tables of accounts and assign entries to proper accounts | 3.82 | Daily |
| 6 | Develop, maintain, and analyze budgets, preparing periodic reports that compare budgeted costs to actual costs | 4.01 | Monthly |
| 7 | Develop, implement, modify, and document recordkeeping and accounting systems | 3.68 | Quarterly |
| 8 | Advise management about issues such as resource utilization, tax strategies, and assumptions underlying budget forecasts | 4.22 | Weekly/Monthly |
(Showing 8 of 23 tasks for illustration)
Stage 2: LLM Capability Assessment
The Rubric
Each task is scored against current frontier LLM capabilities (GPT-4-class models and their specialized derivatives) using a five-point rubric:
| Score | Label | Definition | Examples |
|---|---|---|---|
| 1.0 | Fully Automatable | AI can perform the entire task end-to-end at or above average human quality with minimal oversight | Data entry, basic translation, form completion, invoice processing |
| 0.8 | Highly Automatable | AI can perform 80%+ of the task; human needed only for edge cases or final review | First-draft report writing, tax preparation (standard returns), code generation |
| 0.5 | Partially Automatable | AI can assist significantly (50%) but human judgment required for key decisions | Medical image pre-screening, legal research, financial analysis |
| 0.2 | Minimally Automatable | AI provides marginal assistance; task is fundamentally human | Client counseling, team leadership, crisis management |
| 0.0 | Not Automatable | Task requires physical presence, dexterity, or human-to-human interaction that AI cannot replicate | Surgery, electrical wiring, firefighting, psychotherapy |
Scoring Process
Task-level scoring uses a three-stage process to reduce bias:
- Automated pre-scoring: An LLM (GPT-4o, specifically) is prompted with each task statement and the rubric, producing an initial automation score. This captures the LLM's "self-assessment" of its capabilities β biased toward overestimation, but useful as a starting point.
- Expert calibration: A panel of domain experts reviews and adjusts scores for occupations within their expertise. We engage 12 domain panels covering major occupation groups. Experts typically adjust LLM self-assessments downward by 10β20%, reflecting real-world deployment challenges that LLMs underestimate.
- Empirical validation: Where possible, we validate scores against real-world deployment data β published case studies, employer surveys, and productivity studies showing actual AI performance on specific tasks.
Example Scoring: Accountants (SOC 13-2011)
| Task | LLM Pre-Score | Expert Adjustment | Final Score | Rationale |
|---|---|---|---|---|
| Prepare/examine accounting records | 0.85 | 0.80 | 0.80 | AI handles standard preparation; complex reconciliation still needs humans |
| Compute taxes, prepare returns | 0.90 | 0.80 | 0.80 | Standard returns highly automated; complex situations need judgment |
| Analyze business operations/trends | 0.60 | 0.50 | 0.50 | AI aids analysis but strategic interpretation requires context |
| Report to management on finances | 0.40 | 0.30 | 0.30 | Communication, persuasion, and relationship are key |
| Establish tables of accounts | 0.90 | 0.85 | 0.85 | Highly structured and rule-based |
| Develop/analyze budgets | 0.65 | 0.55 | 0.55 | AI drafts budgets; assumptions and negotiations require humans |
| Develop recordkeeping systems | 0.75 | 0.70 | 0.70 | System design aided by AI; implementation requires organizational knowledge |
| Advise management on resources/tax | 0.30 | 0.25 | 0.25 | Advisory requires trust, judgment, and client relationship |
Stage 3: Weighting
Not all tasks are equal. A task that occupies 40% of a worker's time matters more than one performed annually. We apply two weight dimensions:
3a. Time Allocation Weight
O*NET provides frequency data but not precise time allocation. We supplement with:
- BLS American Time Use Survey (ATUS): Provides time allocation for broad occupation groups
- O*NET task frequency ratings: Mapped to estimated time shares (Daily tasks weighted ~4x annual tasks)
- Expert panel estimates: Domain experts provide time allocation estimates for their occupation group
Time weights are normalized to sum to 1.0 for each occupation.
3b. Economic Importance Weight
Some tasks contribute disproportionately to the economic value of a role. A surgeon's time in surgery is more economically important than their time on paperwork. We estimate economic importance as:
- O*NET task importance rating (1β5) normalized to 0β1
- Adjusted for whether the task is a "core deliverable" (what the role exists to do) or a "support activity"
Final task weight = (Time_weight Γ 0.6) + (Economic_importance_weight Γ 0.4)
Stage 4: Barrier Adjustment
Technical automation capability doesn't translate directly to actual displacement. Structural barriers slow the process:
Barrier Categories and Adjustment Factors
| Barrier | Adjustment | Measured By | Example |
|---|---|---|---|
| Professional licensing | β5 to β15 pts | State/federal licensing requirements | CPA, Bar admission, medical license |
| Physical presence requirement | β5 to β12 pts | O*NET work context: "Physical Proximity" rating | Surgeons, electricians, police |
| Regulatory compliance | β3 to β8 pts | Industry-specific regulation requiring human accountability | Financial advisors, pharmacists |
| Trust/relationship requirement | β2 to β6 pts | O*NET work style: "Concern for Others" + "Social Orientation" | Therapists, social workers, clergy |
| Safety criticality | β3 to β10 pts | Consequence of error + O*NET "Consequence of Error" rating | Air traffic controllers, nuclear engineers |
| Legal liability | β2 to β5 pts | Professional malpractice/liability exposure | Physicians, lawyers, architects |
Barriers are cumulative but capped at β35 points total. The cap prevents occupations from scoring unrealistically low when they have multiple barriers but high technical exposure.
Stage 5: Aggregation
The final ADI score is calculated as:
Raw_TAP = Ξ£ (task_automation_score Γ task_weight) Γ 100
Where TAP = Task Automation Potential
Then:
ADI = (Raw_TAP Γ 0.35) + (EAS Γ 0.20) + (WPI Γ 0.15) + (HDR Γ 0.15) + (BTE Γ 0.15)
Where:
- Raw_TAP (35%): The task-level GenAI capability score described above
- EAS (20%): Employer Adoption Signal β derived from job posting analysis showing AI tool integration
- WPI (15%): Wage Pressure Index β real wage trends relative to national median
- HDR (15%): Historical Displacement Rate β observed employment changes 2020β2025
- BTE (15%): Barrier to Entry β the barrier adjustment from Stage 4, inverted (high barriers = low score contribution)
The result is a 0β100 score where higher values indicate greater displacement risk.
Worked Example: Accountants (SOC 13-2011)
| Component | Value | Weight | Contribution |
|---|---|---|---|
| Task Automation Potential (TAP) | 68.4 | 35% | 23.9 |
| Employer Adoption Signal (EAS) | 72.0 | 20% | 14.4 |
| Wage Pressure Index (WPI) | 45.0 | 15% | 6.8 |
| Historical Displacement Rate (HDR) | 38.0 | 15% | 5.7 |
| Barrier to Entry (BTE, inverted) | 55.0 | 15% | 8.3 |
ADI = 23.9 + 14.4 + 6.8 + 5.7 + 8.3 = 59.1 β Rounded to 59
Note: The published ADI for accountants is 62, slightly higher than this example because the full calculation includes all 23 tasks (not just the 8 shown) and incorporates sub-specialty weighting.
Validation
We validate TGCM scores through three methods:
1. Retrospective Validation
We calculated TGCM scores using 2020 AI capabilities and compared predicted displacement to actual 2020β2025 employment changes:
| ADI Range (2020 calc) | Predicted Employment Change | Actual Employment Change (2020β2025) | Correlation |
|---|---|---|---|
| 0β20 (Low Risk) | +2% to +8% | +4.2% average | Strong |
| 21β40 (Moderate) | β2% to +3% | +1.1% average | Moderate |
| 41β60 (Elevated) | β5% to β1% | β3.8% average | Strong |
| 61β80 (High) | β12% to β5% | β8.2% average | Moderate |
| 81β100 (Very High) | β20% to β10% | β14.6% average | Moderate |
Overall Pearson correlation between predicted and actual employment change: r = β0.72 (strong negative correlation β higher ADI scores predict larger employment declines).
2. Cross-Method Validation
We compare TGCM scores to scores produced by other methodologies:
| Methodology | Authors | Correlation with TGCM | Key Differences |
|---|---|---|---|
| Occupational Exposure to AI (AIOE) | Felten, Raj, Seamans (2021, updated 2025) | r = 0.78 | AIOE focuses on AI application capabilities; TGCM on task-level LLM mapping |
| Exposure to GPT (Eloundou et al.) | OpenAI/UPenn (2023) | r = 0.82 | GPT-focused; TGCM broader (includes non-LLM AI); TGCM adds barrier adjustments |
| Automation Probability (Frey & Osborne) | Oxford Martin School (2013, updated 2024) | r = 0.65 | F&O includes physical automation; less granular on cognitive/LLM tasks |
| OECD AI Exposure Index | OECD (2024) | r = 0.71 | International scope; different task taxonomy; similar directional findings |
High cross-method correlation provides confidence that TGCM captures real signal rather than methodological artifacts.
3. Expert Panel Validation
We presented TGCM scores for 50 randomly selected occupations to 24 labor economists and workforce researchers, asking them to independently estimate displacement risk on a 0β100 scale. Average expert-TGCM correlation: r = 0.74. Notable divergences: experts tended to rate healthcare occupations lower (more optimistic about regulatory barriers) and creative occupations higher (more pessimistic about generative AI impact) than TGCM.
Known Limitations
- Point-in-time assessment: TGCM captures current AI capabilities. Scores must be updated as technology evolves. We recalculate quarterly.
- Task-level granularity ceiling: O*NET tasks are still fairly high-level. A single "task" may contain sub-tasks with different automation profiles. We mitigate this through expert panel review but cannot fully eliminate aggregation artifacts.
- Adoption speed uncertainty: TGCM measures technical capability to automate. Actual adoption depends on economics, regulation, cultural factors, and organizational inertia β all of which are harder to model.
- Complementarity effects: AI may make some tasks more valuable rather than displacing them β a radiologist aided by AI may read more scans, increasing demand for radiologists. TGCM partially captures this through the barrier adjustment but may underweight complementarity in some cases.
- New task creation: TGCM scores existing tasks against AI. It cannot account for new tasks that AI creates (e.g., prompt engineering didn't exist in O*NET until 2024). This means TGCM may overstate net displacement by not counting new roles.
- Geographic variation: TGCM produces national scores. Actual displacement varies by region based on industry concentration, employer behavior, and local labor market conditions.
- Non-LLM AI: While TGCM focuses on generative AI / LLM capabilities, some occupations face displacement from other AI types (computer vision, robotics). We incorporate these where relevant but the methodology is LLM-centric.
Comparison to Other Approaches
How TGCM differs from prior automation risk methodologies:
| Feature | Frey & Osborne (2013) | Eloundou et al. (2023) | TGCM (AIJobWatch) |
|---|---|---|---|
| Unit of analysis | Whole occupation | Task + occupation | Task (weighted) |
| AI scope | Broad automation (including robotics) | GPT-family LLMs | Frontier LLMs + specialized AI |
| Scoring method | ML classifier on occupation features | Human + GPT-4 labeling | LLM pre-score + expert calibration + empirical validation |
| Barrier adjustment | Implicit (in ML features) | Separate exposure vs. vulnerability analysis | Explicit barrier categories with empirical weights |
| Time horizon | "Over some unspecified number of years" | Current capabilities + near-term | 10-year window (updated quarterly) |
| Multi-dimensional | Single probability | Exposure score (Ξ±, Ξ², ΞΆ levels) | Five-pillar composite (TAP, EAS, WPI, HDR, BTE) |
| Update frequency | Periodic (2013, 2017, 2024) | Single publication | Quarterly |
Data Access and Reproducibility
We publish the following for full reproducibility:
- Task-level scores: Every task statement, its automation score, weight, and the final occupation ADI β available as downloadable CSV on our Data page
- Methodology code: The scoring algorithm (excluding proprietary expert panel adjustments) is open-source on GitHub
- Expert panel composition: Named experts and their institutional affiliations (with consent)
- Quarterly changelog: What changed in each quarterly update, including which occupations were re-scored and why
- Validation data: Retrospective validation datasets and correlation analyses
How to Use ADI Scores
ADI scores are designed to be informative, not deterministic. Appropriate uses:
- Career guidance: Comparing relative risk across occupations (e.g., "accounting has higher AI exposure than nursing"). Do not use as sole basis for career decisions.
- Policy analysis: Identifying high-concentration geographic areas or demographic groups. Suitable for targeting workforce development resources.
- Employer planning: Understanding which roles within an organization face highest AI exposure. Useful for retraining investment prioritization.
- Research: As a dependent or independent variable in labor economics research. Cite the methodology version and date.
Conclusion
TGCM is our best effort to bring rigor and transparency to the question of AI displacement risk. It builds on prior work by Frey & Osborne, Eloundou et al., and the OECD while adding task-level granularity, multi-dimensional scoring, quarterly updates, and explicit barrier adjustments. Like all models, it is wrong in specifics β but we believe it is directionally useful and, critically, fully transparent. Every score can be traced back to specific tasks, specific capability assessments, and specific weights. We invite scrutiny, critique, and improvement from the research community, policymakers, and the workers whose livelihoods these scores describe.