Section 2 of 5 · 14 min read
Cleaning Data with AI
AI demos use clean data. Real climate data doesn't look like that. Emissions inventories with different base years. Finance spreadsheets where "million USD" sometimes means 2020 prices and sometimes means current prices, unlabeled. The skill isn't getting AI to clean data — it's knowing what it might break without telling you.
The gap between demos and reality
A 2018 paper in Nature by Resplandy et al. claimed the oceans had warmed 60% more than previously thought — a finding that made headlines worldwide. Within weeks, an independent researcher found that the uncertainty calculations in the data processing step were wrong. The raw ocean measurements were fine; the cleaning and processing introduced the error. The paper was retracted. The outputs had looked entirely credible.
AI will process flawed data without flagging issues. It will produce charts, calculate averages, and identify trends from bad input. The outputs will look credible. This is why data cleaning comes before analysis, not after — and why you are always the editor reviewing AI's work, not the recipient of it.
The right workflow: have AI describe the problems before it fixes anything. "Before we clean anything: tell me every data quality issue you can see. Don't fix anything yet — just report."
Common problems in climate data
Unit inconsistencies
MtCO₂e vs. ktCO₂e vs. tCO₂e — factors of 1,000 that will silently destroy any comparison. Current vs. constant prices in finance data, where the $100 billion climate finance goal was set in 2009 dollars and different reports convert it differently. Currency conversions without specified dates.
A particularly live issue: CO₂ vs. CO₂e, and which Global Warming Potential multiplier was used for methane. Methane has a GWP of 28 over 100 years (GWP-100) but 80+ over 20 years (GWP-20). Using GWP-100 vs. GWP-20 changes total global emissions figures by roughly 49%. This is an unresolved methodological debate — IPCC AR6 presents both values — and AI will pick one without telling you which.
Missing values that become zeros
A country that didn't report adaptation finance is not the same as a country with zero adaptation finance. When blank cells become zeros, "unknown" becomes "nothing," and that distorts every comparison. The UNEP Adaptation Gap Report notes this explicitly as a measurement problem — the gap between reported and actual spending is partly a gap between reporting and not reporting.
Many least-developed countries haven't filed UNFCCC emissions inventories in years. When AI calculates "average emissions per country," it silently excludes dozens of nations that never reported. Your "global average" becomes a "reporting-countries-only average" — and the countries excluded are disproportionately the most climate-vulnerable.
The fix: before any aggregation, ask explicitly: "Are there missing values in these columns? How are you handling them?" And instruct AI: "When you encounter blank or null values, flag them — don't fill, don't drop. Tell me where they are. I'll decide what to do."
Silent row deletion
AI drops rows with problems to make calculations work — and doesn't announce it. "Average disbursement across all countries" might quietly exclude six countries with missing values. Your average is now for 14 countries, not 20. The output looks clean. The methodology is broken.
The check: always ask for row counts before and after any cleaning operation. "How many rows did we start with? How many now? What happened to the difference?" If you started with 1,800 rows and 20 countries and ended with 1,756 rows and 19 countries, you need to know what was lost and why.
Aggregation errors and Scope 3 double-counting
When Company A reports Scope 3 emissions from a supplier, and that supplier reports the same emissions as Scope 1, standard aggregation counts them twice. This is a known, unresolved problem in corporate emissions databases. If you have 5 rows with the same country-year combination, AI might average them, drop all but one, or sum them — depending on how it interprets your question. None of those is automatically right. Duplicates in emissions data could mean double-counting; duplicates in project data could mean multiple funding sources for one project.
Definitional drift
IPCC sector categories evolved between AR4 and AR5. Countries that no longer exist or changed names (Soviet republics, South Sudan) create matching problems across time series. Status fields that mean different things ("complete" vs. "closed"). Free-text organization names where the same entity appears as "WWF," "WWF International," "World Wildlife Fund," and "World Wildlife Fund (WWF)." AI cannot resolve these without domain knowledge you bring.
The right workflow: audit first, clean second
The instinct when you get messy data is to start fixing it. Resist this. Have AI map the problems before it touches anything. Then clean one category at a time, reviewing proposed changes before approving them.
Worked example: OECD climate finance data
Step 1 — Audit
"Before we clean anything: tell me every data quality issue you can see. Look for inconsistent formatting, missing values, apparent duplicates, columns with wrong data types, values that look like outliers. Don't fix anything yet — just report what you find."
This gives you a map of the work. You'll catch things you wouldn't have thought to look for.
Step 2 — Standardize with review
"Standardize all country names to ISO 3166 alpha-3 codes. Show me a mapping table — Original Value | Proposed ISO Code — for every unique country name before you make any changes."
Review the mapping. AI sometimes maps "Georgia" the country to "GA" (the US state). "Korea" may be ambiguous. Regional aggregates like "Developing Countries" should not be mapped to a country code.
Step 3 — Verify after cleaning
"Give me a cleaning summary: how many rows were modified, dropped, flagged? What was the total disbursement value before and after? How many unique country codes do we have now?"
Check the edges: maximum and minimum values. If your maximum disbursement is larger than a country's GDP, you have a unit problem.
Document every cleaning decision
Every cleaning decision is a methodological choice. When your analysis gets challenged — and if it's any good, it will — you need to answer: where did this data come from? What did you do to it? What choices did you make, and why?
Capture: the source (URL, version, date), what you included and excluded and why, how you handled missing values, what normalizations you applied (currency conversions? inflation adjustments? per-capita calculations? what base year?), and how many rows were affected by each operation.
This is especially important when AI did the cleaning. AI doesn't remember what it did across sessions. You document it so you can explain and defend it later.
Ask AI to generate the cleaning code. You don't need to understand every line — you need to confirm it did what you asked. A cleaning script you can re-run is also a reusable artifact: when someone asks how you handled missing values, you can point to the code.