A modern analytics stack
In this chapter, we will talk about the most common setup for an analytics stack. Granted, you may see other data practitioners doing certain parts of this setup differently, but if you take a step back and squint, nearly all data analytics systems boil down to the same basic approach.
Let’s get started.
In the previous section on minimum viable analytics, we mentioned that all analytical systems must do three basic things. We shall take that idea and elaborate further:
- You must collect, consolidate and store data in a central data warehouse.
- You must process data: that is, transform, clean up, aggregate and model the data that has been pushed to a central data warehouse.
- And you must present data: visualize, extract, or push data to different services or users that need them.
This book is organized around these three steps. We shall examine each step in turn.
Step 1: Collecting, Consolidating and Storing Data
Before you may analyze your organization’s data, raw data from multiple sources must be pulled into a central location in your analytics stack.
In the past, this may have been a ‘staging area’ — that is, a random server where everyone dumped their data. A couple of years ago, someone had the bright idea of calling this disorganized staging area a ‘data lake’. We believe that the idea is more important than the name (and we also believe that a dump by any other name would smell just as sweet) and so therefore encourage you to just think of this as a ‘centralized location within your analytics stack’.
Why is consolidation important? Consolidation is important because it makes data easier to work with. Today, we encourage you to use an analytical database as your central staging location. Of course, you may choose to work with tools connected to multiple databases, each with different subsets of data, but we do not wish this pain on even our worst enemies, so we do not wish it on you.
Your central analytics database is usually powered by a data warehouse, which is a type of database that is optimized for analytical workloads. The process by which such consolidation happens is commonly called ETL (Extract Transform Load).
Chapter 2 of the book will go into more detail about this step.
Since we’re talking about a big picture view in this chapter, there are only two key components you need to understand.
1. The Data Consolidating Process
This is when your raw source data is loaded into a central database.
If you’re somewhat familiar with the analytics landscape, you might recall that this process is called ETL (Extract, Transform, Load).
In recent years, there has emerged a more modern approach, known as ELT (Extract-Load-Transform).
To discuss the nuances of our approach, we shall first talk about data consolidation in general, before discussing the pros and cons between ETL and ELT. Yes, you’re probably thinking “Wow! This sounds like a boring, inconsequential discussion!” — but we promise you that it isn’t: down one path lies butterflies and sunshine, and down the other is pestilence and death.
In sum, we will use Chapter 2 to explore:
- How do you setup the data consolidation (Extract-Load) process? What ETL/ELT technology should you choose?
- Why the industry is moving from ETL to ELT. How is ELT different from ETL and why should we care?
2. The central analytics database, or “data warehouse”
This is the place where most of your analytics activities will take place. In this book we’ll talk about:
- Why do you need a data warehouse?
- How do you set one up?
- What data warehouse technologies should you choose?
After going through the above two concepts, what you will get at the end of this step is:
- You will have a data warehouse powerful enough to handle your analytics workload.
- You will have a process in place that syncs all raw data from multiple sources (CRM, app, marketing, etc) into the central data warehouse.
Once you have these two pieces set up, the next step is to turn your raw data into meaningful gold for analytics.
Step 2: Processing Data (Transform & Model Data)
This step is necessary because raw data is not usually ready to be used for reporting. Raw data will often contain extraneous information — for instance, duplicated records, test records, cat pictures, or metadata that is only meaningful to the production system — which is bad for business intelligence.
Therefore, you will usually need to apply a “processing step” to such data. You’ll have to clean, transform and shape the data to match the logic necessary for your business’s reporting.
This step usually involves two kinds of operations:
- Modeling data: apply business logic and formulae onto the data
- Transforming data: clean up, summarize, and pre-calculate data
Chapter 3 goes into more detail about these two operations, and compares a modern approach (which we prefer) to a more traditional approach that was developed in the 90s. Beginner readers take note: usually, this is where you’ll find most of the fun — and complexity! — of doing data analytics.
At the end of this step, you’ll have a small mountain of clean data that’s ready for analysis and reporting to end users.
Step 3: Presenting & Using Data
Now that your data is properly transformed for analytics, it’s time to make use of the data to help grow your business. This is where people hook up a “reporting/visualization tool” to your data warehouse, and begin making those sweet, sweet charts.
Chapter 4 will focus on this aspect of data analytics.
Most people think of this step as just being about dashboarding and visualization, but it involves quite a bit more than that. In this book we’ll touch on a few applications of using data:
- Ad-hoc reporting, which is what happens throughout the lifecycle of the company.
- Data reporting, which we’ve already covered.
- Data exploration: how letting end users freely explore your data lightens the load on the data department.
- The self-service utopia — or why it’s really difficult to have real self-service in business intelligence.
Since this step involves the use of a BI/visualization tool, we will also discuss:
- The different types of BI tools.
- A taxonomy to organize what exists in the market.
Alright! You now have an overview of this entire book. Let’s take a brief moment to discuss our biases, and then let’s dive into data consolidation, in Chapter 2.