Book review: “Data Pipelines Pocket Reference”, James Densmore

Wed, 28 Jan 2026 12:21:58 +0200

Software engineering is all about manipulating data. A big portion of software engineer’s attention is drawn to collection data from users and presenting it back to them in a useful form. However, there is another side of data — the kind that is not produced by software users but only consumed by them. Here, we aim for the goals of achieving a single source of truth, data validity and availability, and enabling performant processing (for analysis or presentation).

To get a better grasp of the tooling for working with such data, this week I read Data Pipelines Pocket Reference by James Densmore. The book focuses on the modern ELT (Extract-Load-Transform) approach, as well as EtLT (with ‘t’ for generic non-business-related data transformation).

It turned out to be a very practical pocket guide indeed. Each section of the book dedicated to data extraction, loading, and transformation is supplied with clear code snippets in Python. The snippets demonstrate means to connect to essential services, such as databases, AWS S3, Amazon Redshift, Snowflake, Apache Airflow, as well as basics of data manipulation.

I liked two things. First, these snippets feel production-ready. Surely, they feature no robust logic, but they are sufficient to start moving data around, running validations and applying transformations. Second, the author not only focuses on interaction with services, but
also provides some tricks of data processing and validation. In particular, there is a neat data testing framework based on separate Python scripts for each check, which can be integrated into Airflow workflows. The approach, while being quite lean, requires a certain mindset to arrive at, so this bit of knowledge was one of the things that saves time and builds a scalable data processing foundation.

That said, I think this book lacks example that are closer to real practice. It would benefit from a companion GitHub repository with a substantial dataset to run ELT against, in addition to the primitive data samples from the book which take no more than 10 rows and 5 columns in a single SQL table. The book also misses any in-depth discussions, making it a pocket reference, indeed.

What it covers:

Data roles: data engineering, data analytics, and data science
Types of pipelines: ETL vs. ELT vs. EtLT
Overview of tools for each ELT step and their orchestration
Minimal instructions for setting up data ingestion and transformation
Approaches to pipeline orchestration
A framework for data validation
Building pipelines with monitoring and maintenance in mind

Verdict: 4 / 5 — a good reference to start building simple ELT pipelines in a day, which is likely exactly what a general software engineer would want if data engineering is not their primary area of specialization

Book review: “Web Scraping with Python”, Ryan Mitchell

Thu, 22 Jan 2026 16:56:17 +0200

Recently I faced a challenge of designing a web crawling and scraping system. To build context, I started my work by reading Web Scraping with Python: Data Extraction for the Modern Web by Ryan Mitchell (3rd edition, revised in 2024).

The book turned out to be a very pleasant read. The author’s approach is well structured with chapters going from simple practical tasks and legal overview to advanced considerations such as Natural Language Processing and race conditions in distributed scraping systems. Code snippets are concise and useful — I can imagine them being used in a small scale production system. In two days, I managed to build a solid understanding of common approaches, architectures, problems, and solutions in this field.

That said, the content is not without its flaws. The section on JavaScript and SSR section is so outdated it is almost hilarious. Mentions of Dynamic HTML, jQuery and AJAX calls are appropriate for a book written around 2010, but not for a revised version from 2024. Nonetheless, even this section is useful at a conceptual level: modern SPAs achieve the same goals as early 2000s web applications that generated dynamic HTML server-side and sent it to browsers.

The ease with which I read this book was strongly influenced by my existing knowledge of the web. Over the years, I have built a solid foundation in HTML and CSS, JavaScript and Python, APIs, application architecture, and networking — all of which helped me clearly see the connections between the author’s ideas. However, the book should still be accessible to any technical reader, thanks to its clear explanations and practical code examples.

What it covers:

Principles of web technologies
Legal and ethical considerations
Common scraping use cases
Building web crawlers and scrapers
Crawling strategies
Transformation and validation of collected data
Parsing text and image documents
Scraping traps
Distributed scraping systems

Verdict: 4.5 / 5 — a go-to practical guide for those planning to build their own scraping system.

Blog — George Mishurovsky: posts tagged book review

Book review: “Data Pipelines Pocket Reference”, James Densmore

Book review: “Web Scraping with Python”, Ryan Mitchell