Book review: “Data Pipelines Pocket Reference”, James Densmore
Software engineering is all about manipulating data. A big portion of software engineer’s attention is drawn to collection data from users and presenting it back to them in a useful form. However, there is another side of data — the kind that is not produced by software users but only consumed by them. Here, we aim for the goals of achieving a single source of truth, data validity and availability, and enabling performant processing (for analysis or presentation).
To get a better grasp of the tooling for working with such data, this week I read Data Pipelines Pocket Reference by James Densmore. The book focuses on the modern ELT (Extract-Load-Transform) approach, as well as EtLT (with ‘t’ for generic non-business-related data transformation).
It turned out to be a very practical pocket guide indeed. Each section of the book dedicated to data extraction, loading, and transformation is supplied with clear code snippets in Python. The snippets demonstrate means to connect to essential services, such as databases, AWS S3, Amazon Redshift, Snowflake, Apache Airflow, as well as basics of data manipulation.
I liked two things. First, these snippets feel production-ready. Surely, they feature no robust logic, but they are sufficient to start moving data around, running validations and applying transformations. Second, the author not only focuses on interaction with services, but
also provides some tricks of data processing and validation. In particular, there is a neat data testing framework based on separate Python scripts for each check, which can be integrated into Airflow workflows. The approach, while being quite lean, requires a certain mindset to arrive at, so this bit of knowledge was one of the things that saves time and builds a scalable data processing foundation.
That said, I think this book lacks example that are closer to real practice. It would benefit from a companion GitHub repository with a substantial dataset to run ELT against, in addition to the primitive data samples from the book which take no more than 10 rows and 5 columns in a single SQL table. The book also misses any in-depth discussions, making it a pocket reference, indeed.
What it covers:
- Data roles: data engineering, data analytics, and data science
- Types of pipelines: ETL vs. ELT vs. EtLT
- Overview of tools for each ELT step and their orchestration
- Minimal instructions for setting up data ingestion and transformation
- Approaches to pipeline orchestration
- A framework for data validation
- Building pipelines with monitoring and maintenance in mind
Verdict: 4 / 5 — a good reference to start building simple ELT pipelines in a day, which is likely exactly what a general software engineer would want if data engineering is not their primary area of specialization