A 2020 Reader's Guide to The Data Warehouse Toolkit
While doing research for The Analytics Setup Guidebook, I read Ralph Kimball and Margy Ross’s legendary book The Data Warehouse Toolkit cover to cover — one of the most difficult things I’d done this year.
If it wasn’t already clear from my last few blog posts, I have great respect for Kimball's ideas, and all the man has done for our industry. But it’s difficult to ignore the fact that TDWT is aging, and badly — there are countless sections of the book where Kimball and Ross dive into database optimisation minutiae — thousands of words that are of little relevance today, given the rise of columnar data warehouses.
So here’s a reader’s guide to reading The Data Warehouse Toolkit in 2020. If you’re a data professional and you want to read the book … perhaps because you want the bragging rights to say that you actually finished the classic that introduced data modeling to the world — well, you’ve come to the right place.
Here’s everything I wished I knew before I started reading The Data Warehouse Toolkit.
1. Don’t Read Everything
The first and perhaps most obvious pointer is the fact that you shouldn’t read the book cover-to-cover like I did.
I’ve just told you that many parts of the The Data Warehouse Toolkit have aged badly. The obvious implication here is that you shouldn’t aim to read everything — you should only read the parts that are timeless! The way I see it, Kimball’s book is really two books pretending to be one: the first contains a set of timeless principles that are likely to be just as relevant today as when he first wrote them in 1996, and the second an implementation book that is tied up with the relational database management systems (RDBMSes) and OLAP cube systems of his day.
You’ll want to read the first book, and completely ignore the second.
(And, yes, you still get full bragging rights about finishing the book, even if you skip over the parts that aren’t as relevant today — this is how book reading works!)
2. Get The Third Edition
Early on in my reading, I found a used copy of the second edition and decided to make do with that.
This was a terrible move.
The biggest problem with The Data Warehouse Toolkit is that it is organised in terms of use cases, not ideas. What this means is that the first substantive chapter in the book is about retail sales, the next one about inventory, the next on procurement, and so on, each chapter examining a new business activity, and introducing one or more new data modeling ideas within the context of the business problem to be solved.
The authors likely thought that this was a comprehensive, lively format, since any new idea was presented within the appropriate real world context. But instead, the format makes the book that much harder to read, since each chapter has to introduce a new business domain, along with a few thousand words of exposition just to set up the problems and challenges that are unique to each use case. This means that the data modeling ideas themselves are often buried deep in the chapter. This made it all incredibly tiring to read.
In the years since The Data Warehouse Toolkit was published, other data modeling practitioners have worked to improve on the original by publishing shorter, more concise books. These newer, shorter books did the more common thing of listing the ideas chapter by chapter, and included case studies and examples only after introducing the data modeling idea.
Kimball and Ross noticed this and finally gave in: in their third edition, published in 2013, they added a new chapter (Chapter 2, “Kimball Dimensional Modeling Techniques Overview”), which served as an index for all the ideas that were spread all throughout the text. This chapter is why you should read the third edition, and the third edition only. Use Chapter 2 as a guide to the rest of the book. Jump around to familiarise yourself with Kimball’s techniques. And only read a chapter from beginning to end if it is a business domain you care about.
3. When Using the Chapter 2 Index, Focus on Timeless Techniques
Chapter 2 is organised as a list of ideas and techniques. It looks something like this:
In this author’s opinion, the following ideas are relevant if you’re doing a data analytics career in 2020. Everything else is gravy.
- You’ll want to start with Kimball’s four-step dimensional modeling design process. It’s probably a good idea to go through everything in the ‘Fundamental Concepts’ section — while keeping in mind that modern data analytics does not commonly perform an up-front data modeling effort, with all the business stakeholders present. Due to the nature of modern BI technology, it’s far simpler to redo data models, so the best practice is to run with lightly transformed data, and only remodel in a Kimball manner later when usage dictates. (For more detail on this, check out Chapter 3 of our free guide to data analytics, here).
- Basic fact table techniques — In particular, you should pay attention to the three kinds of fact tables (summarised here), along with a fourth, rarely used kind.
- Basic dimension table techniques — You should pay attention to degenerate dimensions, multiple hierarchies, and the relatively short discussion on surrogate, natural, durable, and supernatural keys.
- Integration via Conformed Dimensions — You want to pay special attention to the Enterprise Data Warehouse Bus Architecture and Matrix along with the Opportunity/Stakeholder Matrix. Note that this is mostly due to historical reasons — this architecture is not recommended today, given the power of cloud-based data warehouses. But it’s still useful to read up on this, because it will give you a deeper appreciation for the ‘covered ground’ in our field.
- Dealing with Slowly Changing Dimension Attributes — read all of this. It’s still relevant. Then, for a modern update, check out Slowly Changing Dimensions in The Age of The Cloud Data Warehouse.
- Dealing with Dimension Hierarchies — Go through everything in here as well; this mostly lets you know that certain business scenarios might require you to deal with weird dimensional hierarchies (e.g. an employee table where peers can also be managers of each other, within the context of specific projects), and the set of ideas presented here will expand your toolbox.
- Advanced Fact Table Techniques — You’ll want to skim through Multiple Currency Facts and Timespan Tracking in Fact Tables, depending on your business domain, but more importantly you want to read Late Arriving Facts, since this is a commonly occurring pattern.
- Advanced Dimension Techniques — You’ll want to read Audit Dimensions — because this has to do with data quality, and it’s been adapted quite successfully to modern analytics (read more here); you’ll also want to read up on Late Arriving Dimensions.
- Special Purpose Schemas — The most generally useful idea here is Error Event Schemas, which is a nice addition to audit dimensions. Again, this is a data quality thing, and as Uber has demonstrated, still quite applicable to the contemporary practice of analytics.
That’s it! If you enjoyed this blog post, do consider sharing it with a friend. I wish you the best of luck as you attempt to read The Data Warehouse Toolkit.