Data Modeling Layer & Concepts
A contemporary look at data modeling
In this section we’re going to introduce data modeling from scratch. We shall approach this in a contemporary manner, which means that our presentation here is going to seem rather unusual to you if you’ve had prior experience with more classical techniques.
More specifically: if you have lots of experience with Kimball, Inmon or Data Vault-style data modeling, skim this section to familiarise yourself with the terms we introduce. We will tie the concepts back to the more classical approaches once we get to the next two sections.
If you don’t have any experience with data modeling, then buckle in. We’re going to take you on a ride. Let’s get started.
What is data modeling and why is it needed?
To best understand what data modeling is, let’s imagine that your company runs a homestay booking site, and that it has data stored in a production database.
When the CEO has a question to ask about data, she goes to the data analyst and asks: “Hey, Daniel can you help me get the sales commissions numbers for bookings in this region?”
Daniel listens to the CEO’s request, goes to his computer, and comes back with the data, sometimes in the form of a short written note, other times with the data presented in an Excel file.
So here’s our question: why can’t the CEO do it themselves? Wouldn’t the ideal situation be that the CEO opens up some exploration tool that’s linked directly to the production database and helps herself to the data? Why does she have to go through the data analyst to get it for her?
“She doesn’t know how to do it”, you might say, or “This is serious technical stuff”. These are common responses that you might get when you pose this question.
But there’s something else that’s going on here.
The CEO (or any other business user, for that matter) thinks in business terms, using business logic. Your actual data, on the other hand, is stored in a different format. It follows different rules — often rules imposed by the implementation of the application.
The CEO can’t translate her mental model of the business into code in order to run the numbers. She doesn’t know how the data is organized. But Daniel does.
For example: when asking about sales commissions, the CEO will think “sales commissions is 5% of closed deals”. However, the data analyst will think “closed deals are stored in table closed_deals
, so I need to take the amount
column and multiply that with months
to figure out the final amount; oh, and I need to check the payment_received
column to make sure that only the received payment is counted”.
Here, Daniel the data analyst simply serves as a “data translator”:
- He receives a data question in business English
- He figures out where the corresponding data lies in his data warehouse
- He then translates the business question into corresponding data logic, and expresses this logic in the form of a data query (usually SQL).
- He runs the query, get the results to Excel, formats it, and then sends it over to the CEO.
Essentially, the data analyst knows the mapping between business logic to data logic. That’s why he is able to help the CEO with her data questions.
This process works fine for some companies. But it will not scale up beyond a few people, and is an incredibly inefficient way to do things. Why? Well:
- Your data analyst Daniel is now a bottleneck. Every small change needs to go through him.
- What if Daniel goes on leave? What if Daniel leaves the company? What if Daniel forgets how a certain piece of business logic is implemented?
- Every time the CEO wants something, she needs to wait hours (or even days) for Daniel to crunch the numbers and get back to her.
- At one point, Daniel might be too busy crunching out numbers for different business stakeholders, instead of focusing his time on more valuable, long-term impact work.
So what do we need to do here?
We need to offload the mapping knowledge inside Daniel’s head into some system, so that anyone can understand it. We need to externalize it, so that it doesn’t just live in Daniel’s head.
The Data Modeling Layer
Earlier in this book, we introduced you to the idea that we should extract data into a data warehouse first, before doing transformations. We mentioned that this is commonly known as the ‘ELT’ paradigm.
What we want to do now is to perform some series of transformations to offload the mapping in Daniel’s head to something that is persisted in a data warehouse. Again, all of these transformations are to be performed within the data warehouse, as per the ELT paradigm.
This process of mapping raw data to a format that can be easily understood by business users is known as ‘data modeling’. There are other reasons to do data modeling, of course. Performance is one of them, as is explorability. But at its most basic level, data modeling is about taking raw data and transforming it into a form that is useful for business measurement.
The contemporary approach to doing data modeling is to orchestrate transformations within the data warehouse, via a tool that sits on top of the data warehouse. This stands in contrast to ETL tools in the past, which usually exist as pipelines external to the data warehouse.
These tools include such tools like Holistics, dbt, dataform and Looker. These tools share a couple of similar characteristics:
- They connect to your data warehouse.
- They treat the modeling process as the act of transforming data from old tables to new ones within the data warehouse.
- They generate SQL behind the scenes to execute such transformations.
- They allow users to annotate, manage, and track changes to data models over time.
- They allow users to trace the lineage of data transformations within a single tool.
There isn’t a good name for such tools right now. For the sake of convenience, we will call them ‘data modeling layer’ tools. Conceptually, they present a ‘data modeling layer’ to the analytics department.
A data modeling layer is a system that contains the mapping between business logic and underlying data storage rules of your business. It exists primarily in the ELT paradigm, where data is loaded into the data warehouse first before being transformed.
In this context, data modeling is the process of building and maintaining this layer.
Usually, the data modeling layer will later be connected to some visualization tool or business intelligence layer. Non-technical users should be able to log in, interact with some user interface, and get the analytics they need, without the requirement to talk to anyone technical.
With a proper, well-maintained data modeling layer, everyone is happy:
- The CEO can just log in to the BI application, ask questions and get the right numbers that she needs, without waiting for the data analyst. In other words, business users can now do self-service analytics.
- The data analyst’s job is now focused on maintaining the data pipeline and modeling layer, without being bombarded by adhoc data requests.
- The entire company has a well-documented layer of data knowledge. Even if the data analyst is busy or leaves the company, this knowledge is properly annotated and organized, and not at risk of being lost.
Now that we know what data modeling is at a high level and why it’s important, let’s talk about specific concepts that exist in the data modeling layer paradigm.
Data Modeling Layer Concepts
Let’s return to the homestay booking example above. We shall use this as an overarching example to introduce data modeling layer concepts to you.
These are the basic database tables that you pulled into your data warehouse.
Now, let’s look at a few modeling operations that you can apply to the above data.
We’ll be using Holistics for the examples below. That said, these concepts map pretty well across any data modeling layer-type tool in the market. (We’ll tell you if we’re introducing an idea that is specific to Holistics, with no clear external analogs).
Concept: Data Model
When manipulating data in a data modeling layer, it’s common not to deal with the underlying data table directly, but to instead create an abstract object above it for ease of manipulation.
This is called a data model.
A data model is an abstract view on top of a physical database table that you may manipulate without directly affecting the underlying data. Most data modeling layers allow you to store additional metadata that may enrich the underlying data in the data table.
The most common types of metadata at the level of a data model are textual descriptions, calculated dimensions that capture some business logic, and relationship mappings between a model and some other model (or table). Data modeling layers often also include housekeeping metadata alongside the data model, such as a full history of user accounts who have modified the model, when that model was first created and last modified, when the underlying data was last refreshed, and so on.
While database tables hold data, data models often contain metadata to provide extra context for that data.
A data table is managed by the database, and a data model is managed by the data modeling layer that you use.
Here’s a quick comparison between the two:
Data Table:
- Is ‘physical’, lives in the analytical database
- Store actual data records
Data Model:
- Is an abstract object, managed by the data modeling layer
- Stores metadata (description, business logic, relationship mapping)
- Usually sits above a data table
- Is usually created via a
SELECT
statement from the data table.
By connecting a modeling layer to your analytics database, you can “model up” your database tables so that you can add textual descriptions, or metadata. This is what it looks like in Holistics:
In the Holistics interface above, we take a table in our data warehouse named homestay.listings
and create a data model named homestay_listings
. This model is ‘abstract’, in the sense that it lives only within the data modeling layer.
Similar flows exist in dbt, dataform and Looker.
Concept: Relationship mapping
A relationship mapping is a foreign relationship between two data models.
This relationship is analogous to the foreign key concept in relational databases. In this context, however, we are creating relationships between two data models instead of tables — that is, relationships between two abstract ‘views’ of data that don’t actually exist in the data warehouse below.
Defining a model’s relationships is like defining JOINs between the data tables.
This is useful because you may want to create models that derive from other models, instead of deriving from just underlying database tables.
Concept: Custom Field Logic
Remember earlier that we had a few pieces of business logic that the CEO might want to check? For instance, one metric the CEO might be particularly interested in is the number of guests per booking. This way, she can have a good idea of the inventory matchup between available rooms and groups of guests. Notice how a measure of the number of guests might not exist in the underlying table. Instead, we’re going to define it as a combination of table dimensions, but within the data modeling layer.
In Holistics, we open up the homestay_bookings
model and define our custom dimensions and measures, like so:
We have defined two custom fields here. The first is the sum of nights stayed from a successful booking (that is, a booking that has been seen to check out). If the booking has reached the ‘checked out’ state, then the nights are counted. Otherwise, it returns 0.
The second is a calculated field that returns the total number of guests per booking. This latter field is a simple sum, because the homestay.bookings
table stores the number of children, babies, and adults as separate numbers. In our homestay_bookings
model, we simply define the total number of guests as the sum of all three numbers.
We call these measures ‘custom fields’ in Holistics; though the idea is more important than the specific name we’ve used. The idea is this: data modeling layers allow you to create calculated fields that are combinations of other fields within the model. Similar concepts exist in other modeling BI tools — the point is that you want to be able to augment existing dimensions in the underlying tables.
Concept: Models Built On Top of Other Models
At this stage, we’ve merely demonstrated that data models allow you to annotate and enrich existing tables in your data warehouse. But what if you want to transform your data into a new set of tables? Imagine, for instance, that you want to take the data within homestay.listings
and turn it into a star schema. In the past, you might have done this by asking a data engineer to set up a new ETL pipeline. With a data modeling layer tool, however, you may do this within the same user interface.
Below, we’ve created a new model that is derived from the homestay_listings
data model (not the table!). This model is called bookings_revenue
, and it combines fields from two different models — the homestay_listings
model and the homestay_bookings
models — to calculate things like gmv
and host_revenue
.
Notice the conceptual leap that’s just happened. homestay.bookings
and homestay.listings
are two tables in our data warehouse. We’ve created two models respectively above them: homestay_bookings
and homestay_listings
. As we’ve mentioned previously, these models allow us to annotate or create new derived fields without touching the tables themselves.
We have then taken the two models, and created a new transformed model on top of them.
This is powerful for two reasons. First, all of this has happened via SQL. We do not need to wait for data engineering to set up a new transformation pipeline; we can simply ask an analyst to model the data within the Holistics data modeling layer, by writing the following SQL:
As you can see, our new bookings_revenue
model is simply a SELECT
statement that JOINS
two models. (Emphasis added).
Second, the fact that our model exists as a simple join means that our bookings_revenue
model will be updated whenever the two underlying models (or their underlying tables) are updated!
Again, the idea here is more important than the particular implementation we’ve shown you. You will find similar flows in other data modeling layer tools (like Dataform, for instance).
Concept: Model Persistence (equivalent to materialized view)
Once you understand that it’s possible to create new models from other models by writing SQL, the next step is to ask: is it possible to persist such transformations?
The answer is yes, of course!
While data models are normally ‘views’ (or SELECT
statements) on top of physical tables, nearly all data modeling layers allow you to persist these as new tables within the data warehouse. This is what the Holistics persistence menu looks like, for instance:
Here, Holistics offers to update these models at periodic times. In our example above, we can choose to persist bookings_revenue
to our data warehouse, and let Holistics update the model every day at 7 am (or whatever other time intervals we wish).
Persistence is useful for performance reasons; in this case, our revenue reporting may run on top of the bookings_revenue
persisted table, instead of the abstract model itself.
Putting things together
Let’s put the above concepts together in a single diagram, as a way to wrap things up.
Notice three things:
homestay_listings
is a base model that is derived fromhomestay_bookings
andhomestay_hosts
.bookings_revenue
is a ‘transformed’ model that is drawn fromhomestay_bookings
andhomestay_listings
. Note how it contains custom fields that are basically calculations of multiple fields that exist within its underlying models.- Similarly,
bookings_profit
is a transformed model that is taken frombookings_revenue
.
This dependency diagram of models is pretty representative of the sort of work analysts at Holistics do on a day-to-day basis. The job of data analytics is essentially a process of modeling raw tables into base models, and then modeling base models into transformed models.
Our analysts do this until they have a tapestry of data models that represent everything our business people would ever want to know about our company. Then, creating a report simply becomes an issue of picking the right models to present to the business.
What do business users get in the end?
Remember at the beginning of this section, we said that with a data modeling layer, the CEO can extract meaningful numbers by herself without having to bother the data analyst? Well, we weren’t lying.
With a SQL-based data modeling BI tool (like Holistics or Looker), the CEO can now use a UI to help herself to the data she needs, based on the metrics, descriptions, and relationships we have defined above.
Example of self-service data exploration UI that CEO can play around.
What happens behind the scenes is this: the CEO’s selection will be translated into a corresponding SQL query (thanks to the modeling layer), and this query will be sent to the analytical database. The retrieved results will then be displayed for the CEO to view.
Summary
So what have we covered? We’ve looked at data modeling in the ELT paradigm. The modern approach to data modeling in this paradigm is to use what we call a ‘data modeling layer’, though this is a name that we’ve adopted out of convenience. Such tools include Dataform, dbt, Looker, and Holistics itself.
We then discussed several ideas that exist within this approach to data modeling:
- We talked about how data models are ‘abstract views’ on top of your data, within the context of a ‘data modeling layer’.
- We talked about how such data modeling layer-type tools allowed users to enrich models with metadata, and how most modeling layers added useful housekeeping data like user modifications, freshness of the underlying data, and so on.
- We discussed two useful features of data models: custom field logic and relationship mapping.
- We then talked about how data models may be built on top of other models.
- Finally, we talked about how such derived models may be persisted, just like materialized views in a more traditional relational database system.
In our next section, we shall talk about how this approach to data modeling works when it is combined with the classical method of dimensional data modeling.