Jeremy Nelson - Data Engineer in Chicago

Building Data Pipelines That Last: A Pragmatic Approach

The best data engineers I know aren't the ones chasing every new tool. They're the ones who can explain why a 5-year-old pipeline still works and when it's actually worth rebuilding.

This isn't about resisting change. It's about understanding the real cost of complexity.

The Trap of "Modern" Data Stacks

Every few months, there's a new "must-have" tool in data engineering. The hype cycle is relentless:

Here's what nobody tells you: Most companies don't need real-time. They need reliable. They need understandable. They need maintainable.

A pipeline that runs once per day and has operated flawlessly for three years is infinitely more valuable than a real-time streaming architecture that requires constant babysitting and breaks every time the upstream schema changes.

What Actually Makes Pipelines Last

1. Simplicity Beats Cleverness

The most durable pipelines I've seen are boring. They use standard tools in standard ways. They don't rely on obscure features or complex orchestration logic.

Bad: A 200-line Airflow DAG with nested task groups, custom operators, and dynamic task generation that only the original author understands.

Good: A 20-line DAG that anyone on the team can read and modify in 30 minutes.

Complexity is a liability. Every line of custom code is a line you have to maintain, debug, and explain to the next person.

2. Fail Loud, Fail Fast

Silent failures are the enemy. A pipeline that "succeeds" but produces wrong data is worse than one that fails outright.

Build in assertions:

Alert on anomalies, not just errors. If your daily transaction count suddenly drops by 80%, that's a problem even if the pipeline technically completed.

3. Document the Why, Not Just the What

Code comments explain what the code does. That's table stakes. The real value is explaining why:

Six months from now, someone (possibly you) will need to modify this pipeline. They'll need to know which assumptions can be safely challenged and which are load-bearing.

4. Design for Debugging

Pipelines will fail. When they do, you need to answer three questions quickly:

  1. What failed?
  2. When did it fail?
  3. What was the last known good state?

This means:

5. Know When to Rebuild

Sometimes pipelines do need to be replaced. The key is having criteria for that decision:

Rebuild when:

Don't rebuild when:

The 5-Year Test

Before adding complexity to a pipeline, ask yourself: "Will I be able to explain this to a new hire in 5 years?"

If the answer is no, simplify. If you can't simplify, document extensively. If you can't document it clearly, reconsider the approach.

Real-World Example

I inherited a pipeline that moved data from a legacy CRM to a data warehouse. It was "ugly":

The team wanted to rebuild it with modern tooling: Kafka streams, microservices, the works.

We didn't. Instead, we:

Two years later, it's still running. Total downtime: zero. Total engineering time spent: minimal.

The "modern" approach would have taken months to build, required ongoing maintenance, and introduced multiple new failure modes.

The Bottom Line

Reliable systems are built on:

Not on the latest tools.

Flashy demos are easy. Reliable systems are hard. Choose hard.


What data engineering practices have stood the test of time for you? I'd love to hear your experiences.