Section 5 of 5 · 12 min read

The Data Pipeline

Doing good data work once is a skill. Doing it reliably — in a way that you can explain, replicate, and hand off — is a system. The pipeline connects every stage we've covered into a repeatable process. The Ground Truth Document is what keeps it honest.

Five stages, one throughline

Most data work failures happen at the handoffs between stages — not within any single stage. The analysis is fine, but it was run on data that wasn't fully cleaned. The visualization is honest, but the hero metric it highlights wasn't interrogated for framing assumptions. The pipeline approach forces you to complete each stage before advancing, and to document what you did before you forget.

Source

Identify your data source. Evaluate it against the four criteria: methodology documented, currency, political independence, coverage match. Download and version your data — never work from a URL that might change.

Gate check: Can you name the source, its methodology, and its last update date?

Clean

Run the audit before any cleaning. Handle missing values explicitly. Track row counts before and after. Document every decision — what you changed, what you excluded, and why.

Gate check: Do your row counts match? Are your totals plausible? Can you explain every cleaning choice?

Analyze

Choose your framing consciously: absolute, per-capita, indexed, intensity, cumulative, consumption-based. The right framing depends on your audience and their decision. Run multiple framings and compare. Gut-check outputs against known values.

Gate check: Does this framing serve your audience's actual decision? Does the output survive a sanity check?

Hero metric

Surface candidates using AI, then evaluate each against four properties: surprising, concrete, relevant to your audience, defensible under challenge. Choose one. Write the sentence: 'The single most important number is ___, which shows ___.'

Gate check: Can you defend this number when challenged? Does it pass the four-property test?

Visualize

Choose the chart type that shows your finding most clearly. Check for the four deceptions. Write the annotation that tells your audience what the data means. Test with the lie factor: does the visual impression match the data?

Gate check: Does the visual impression match the actual data difference? Are the deceptions absent?

The Ground Truth Document

AI doesn't remember what it did. If you run a cleaning session on Monday and come back on Thursday, the AI starts from scratch. The Ground Truth Document is the one place where you capture what actually happened to your data — not what you intended, but what you did.

It has four sections: Data source (URL, version, date accessed, description of what the dataset contains and how it was collected), Cleaning decisions (what you changed, what you excluded, how you handled missing values, row counts before and after), Analysis choices (which framing you used, why, what alternatives you considered), and Hero metric rationale (the metric, its source calculation, and the argument for why it works for your audience).

For a single chart, this is a paragraph. For a major report, it's several pages. The content is always the same — just more of it. The document is also what you hand to a colleague when they take over your analysis, and what you refer to when someone challenges your methodology six months later.

In 2024, Kotz & Wenz's estimate that climate change would cost the world $38 trillion annually had to issue a retraction in Nature. The global damage figure had been skewed by a single bad data point. The analysis pipeline looked rigorous. One unchecked outlier distorted the headline number policymakers were citing worldwide. A Ground Truth Document that captured outlier handling would have surfaced this.

When something doesn't add up

Verification failures happen at three levels: the data (wrong source, coverage gaps, cleaning errors), the analysis (wrong framing, aggregation errors, unchecked outliers), and the visualization (deceptive defaults, wrong chart type, missing context).

When a verification check fails, the instinct is to fix the immediate problem. The better move is to ask: at which stage did this enter the pipeline? If a total doesn't match what you expect, go back to the cleaning log. If a trend looks wrong, check whether the start year was intentional. If a visualization seems misleading, check the axis settings before assuming the data is wrong.

The most underused check: cross-validate with a different tool. Upload your cleaned dataset to a different AI and ask: "Does this data look clean? Flag any remaining problems." Different tools have different blind spots. If both find the same issue, it's real. If only one does, investigate before acting.

Exercise

Hero Metric Finder

Describe your dataset and audience. Get a recommended hero metric, two alternatives, and one warning about what not to do.