Less Haste, More Speed
Data-focused teams can stand out in businesses as strange - especially those managing big data sets. Why do they take so much longer to deliver features or bug fixes than the web team? Why does transforming historical data take weeks, where major changes to a user journey can take half the time?
A harsh environment
In my experience, data engineering is less forgiving than many other parts of modern development. The margin for error is less because - over enough data - the impact of mistakes on a task’s delivery time are exaggerated.
Bugs that make it to production are often worse.
In many cases, a mobile app team can release a bug fix, and be happy that upgrading customers can access the feature as intended. If your ingest pipeline has a bug that leads to dubious data, however, you need a forward-facing fix and to fix the data that’s already arrived.
There are lots of ways to avoid these issues - today I want to focus on just one practice.
The discipline to focus on a subset
When faced with a data set too large to practically work on, most data engineers will recognise you need a subset. Maybe you pick one week, or maybe a sample of 1/10,000 would be more representative.
Against that subset, you can try out your transformation or query; building it incrementally & questioning your intermediate results each step of the way. This gives you a (fairly) high-quality, tight feedback loop.
The place where junior (and some more experienced) engineers tend to stray is in moving on from that subset too early. If your work isn’t trivially straightforward, there are always edge-cases to eliminate. Foundational assumptions you might not have considered.
Have you accounted for
null values (is the field nullable at all)?
If you’re dealing with time-series data, are all of the
timestamp fields definitely for UTC? Are the timestamps provided by the server or are clients providing them? How could they be off?
Is your text substring search on that mass extract of log-data case-insensitive? How can you test that?
These questions want to be answered before you expand to the larger dataset. less haste, more speed.
… in their hearts…
Most data engineers know this, but many - especially when mentally exhausted or put under pressure - will choose to skip ahead early. Especially if an update feels minor.
Is that you? Or are you rushing your team for a short-term result? Is your team overloaded tasks & context switches that damages their focus?
If you find tasks dragging out because of botched attempts on full data sets, those questions are worth reflecting on.
Got thoughts or questions? I'm here to help at email@example.com