Building Data Pipelines That Last: A Pragmatic Approach
The best data engineers I know aren't the ones chasing every new tool. They're the ones who can explain why a 5-year-old pipeline still works and when it's actually worth rebuilding.
This isn't about resisting change. It's about understanding the real cost of complexity.
The Trap of "Modern" Data Stacks
Every few months, there's a new "must-have" tool in data engineering. The hype cycle is relentless:
- "You need real-time streaming!"
- "Batch processing is dead!"
- "Migrate everything to [insert latest framework]!"
Here's what nobody tells you: Most companies don't need real-time. They need reliable. They need understandable. They need maintainable.
A pipeline that runs once per day and has operated flawlessly for three years is infinitely more valuable than a real-time streaming architecture that requires constant babysitting and breaks every time the upstream schema changes.
What Actually Makes Pipelines Last
1. Simplicity Beats Cleverness
The most durable pipelines I've seen are boring. They use standard tools in standard ways. They don't rely on obscure features or complex orchestration logic.
Bad: A 200-line Airflow DAG with nested task groups, custom operators, and dynamic task generation that only the original author understands.
Good: A 20-line DAG that anyone on the team can read and modify in 30 minutes.
Complexity is a liability. Every line of custom code is a line you have to maintain, debug, and explain to the next person.
2. Fail Loud, Fail Fast
Silent failures are the enemy. A pipeline that "succeeds" but produces wrong data is worse than one that fails outright.
Build in assertions:
- Row count checks ("Did we get roughly the expected volume?")
- Data quality checks ("Are null rates within normal ranges?")
- Business logic validation ("Does revenue match our reconciliation reports?")
Alert on anomalies, not just errors. If your daily transaction count suddenly drops by 80%, that's a problem even if the pipeline technically completed.
3. Document the Why, Not Just the What
Code comments explain what the code does. That's table stakes. The real value is explaining why:
- Why this join is necessary
- Why this filter exists
- Why this business rule is applied
Six months from now, someone (possibly you) will need to modify this pipeline. They'll need to know which assumptions can be safely challenged and which are load-bearing.
4. Design for Debugging
Pipelines will fail. When they do, you need to answer three questions quickly:
- What failed?
- When did it fail?
- What was the last known good state?
This means:
- Granular logging (not just "step 3 started" but "processing partition 2024-01-15, 1.2M rows")
- Clear error messages (not just "NullPointerException" but "Expected column 'customer_id' missing from source table")
- Easy reruns (can you replay just the failed partition without rebuilding everything?)
5. Know When to Rebuild
Sometimes pipelines do need to be replaced. The key is having criteria for that decision:
Rebuild when:
- The business logic has fundamentally changed
- The underlying data sources have shifted
- The current solution can't scale to meet new requirements
- Maintenance cost exceeds replacement cost
Don't rebuild when:
- There's a new shiny tool
- The code is "messy" but functional
- Someone just prefers a different approach
The 5-Year Test
Before adding complexity to a pipeline, ask yourself: "Will I be able to explain this to a new hire in 5 years?"
If the answer is no, simplify. If you can't simplify, document extensively. If you can't document it clearly, reconsider the approach.
Real-World Example
I inherited a pipeline that moved data from a legacy CRM to a data warehouse. It was "ugly":
- Used an old Python 2 script
- Had hardcoded business rules
- Ran on a cron job on a single EC2 instance
The team wanted to rebuild it with modern tooling: Kafka streams, microservices, the works.
We didn't. Instead, we:
- Added monitoring and alerting
- Documented the business rules
- Set up automated testing
- Created a runbook for common issues
Two years later, it's still running. Total downtime: zero. Total engineering time spent: minimal.
The "modern" approach would have taken months to build, required ongoing maintenance, and introduced multiple new failure modes.
The Bottom Line
Reliable systems are built on:
- Clear requirements
- Simple implementations
- Good observability
- Thoughtful documentation
Not on the latest tools.
Flashy demos are easy. Reliable systems are hard. Choose hard.
What data engineering practices have stood the test of time for you? I'd love to hear your experiences.