OLAP != OLAP Cube
One of the most consistent bits of confusion I’ve seen in reaction to last week’s The Rise and Fall of the OLAP Cube is the perception that OLAP and OLAP cubes are the same thing.
They aren’t.
OLAP (short for ‘online analytical processing’) is a pattern of database usage — something I dedicated an entire section to in my original piece. Go read that section if you’re still confused; in a single sentence, OLAP is the type of work you must do if you are performing typical data analysis (e.g. you filter a dataset based on some set of dimensions, aggregate values in the collection, or present data across those dimensions).
The OLAP cube, on the other hand, is an efficient data structure used to perform OLAP workloads. I sometimes joke that OLAP cubes are just ‘very sophisticated nested arrays’ … which has a grain of truth in it, but only if you squint real hard.
In reality, OLAP cubes as they are implemented today are the result of over 40 years worth of research and development. Cubes are implemented with dozens of optimizations: these include advanced data compression techniques, various embedded data structures to help with indexing and filtering and scanning, and heuristics to figure out what to load into memory, what to pre-aggregate, and what to persist to disk.
In short: you may run OLAP workloads on an OLAP cube or in a columnar database; OLAP != OLAP cube.
Why Understanding These Definitions Matter
At this point you might think “this seems like a definitional thing, why does it matter what words mean?” But I think there are two reasons it matters.
The first reason is that accurate definitions are important if you want to track the latest developments in the field. The art of business intelligence advances when new ideas are published in research papers and when new ideas are implemented in commercial products. More often than not, the marketing materials of these products use terminology that's inherited from academia.
When a research paper says ‘OLAP’, for instance, they mean ‘a type of workload’, and when they say ‘OLAP cube’ they mean ‘a specific data structure’. It is thus important to understand the original meaning of those terms — at least, it is if you want to keep up with the evolving landscape. It is also important to understand these definitions if you want to evaluate competing offerings in both columnar and cube camps.
The second reason it matters is because it is helpful when you communicate with other people in the data world. Because of widespread confusion, different people mean different things when they say ‘OLAP’ and ‘OLAP cube’. If you understand what alternative definitions exist, you’ll be able to code-switch and use their definitions instead. I believe using ‘wrong’ definitions may sometimes be useful. After all, it can make for the difference between a good discussion and a bad one: you either speak their language, or you talk past each other.
Better to understand what they mean, so you can do more of the former and less of the latter.
How People Conflate The Two Concepts
There are two ways that people conflate OLAP with OLAP cubes.
The first way is to believe that OLAP workloads can only run on top of cubes. Take this blog post by OLAP cube provider Kyligence, for instance, which says:
OLAP (short for OnLine Analytical Processing) is an approach designed to quickly answer analytics queries involving multiple dimensions. It does this by rolling up large, sometimes separate, datasets into a multidimensional database known as an OLAP Cube.
Wikipedia also puts the two together in its article about OLAP:
At the core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures that are categorized by dimensions. The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space.
As I’ve written earlier, this is no longer the case: columnar data warehouses are now able to run large OLAP workloads, and they call for a slightly different approach (you materialize views and/or persist them within the warehouse, instead of extracting such data into cubes).
To be fair to these two pieces, it was true that OLAP meant using an OLAP cube, at least for much of the history of business intelligence. I don’t mean to say that you can’t do OLAP work with a traditional relational database (this is known as ROLAP, and is doable if a little annoying); the fact remains that for the past 30 years, a cube was your best option if you wanted to do serious analysis on a large dataset.
The second way that people conflate the two ideas is to say that the OLAP cube is an abstract concept, and a columnar datastore is merely a way to implement this concept. I refer to this comment on Hacker News:
It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by columnar data stores. I always understood OLAP cube as an abstract concept that can have various technical implementations, while column store is a kind of optimization in that technical implementation.
As well as this comment:
I have a slight problem with the terminology in the middle of the article, as I'm so far down the rabbit-hole that I think of cubes as databases; I suffer cognitive dissonance when I read about shifts from cubes to databases etc. To me, a cube is just a fancy term for a table/view for a particular use-case.
This confusion is understandable, because cube interfaces were widespread in the old paradigm, and OLAP cubes were pretty much the only option for serious OLAP work not too long ago. To make matters worse, several relational databases implemented a cube
operator in the 90s, which materialized cube-like views for OLAP workloads, because, well, why not? (Gray, Bosworth, Lyaman & Pirahesh, 1996)
There are two pieces of evidence that suggest this view is mistaken, however. The first piece is from the mainstream columnar database providers themselves. You’ll note that BigQuery, Redshift and Snowflake will all say that they are well suited for OLAP workloads, but will never say they are OLAP cubes. OLAP cubes may be materialized from columnar databases, but columnar databases are not themselves cubes. They are simply two different things.
The second is to look at the research done on cubes from the beginning of time. My first instinct here was to look for summary papers to see what they said of OLAP and OLAP cubes; I found Alfredo Cuzzocrea’s paper on 10 years of cube compression techniques rather quickly (Cuzzocrea, 2010). Next, I searched through Cuzzocrea’s publication history for an implementation paper, and found one where the authors describe a mobile-oriented cube system (Cuzzocrea, Furfaro & Sacca, 2008).
In both papers, ‘OLAP’ meant ‘workload type’, and ‘OLAP cube’ (or ‘data cube’) meant a specific data structure. I then skimmed their citations for prior work, and was satisfied that this usage was consistent, stretching back to the early 90s.
Conclusion
I’ve spent about a thousand words at this point explaining the differences between these two ideas, so I think I should wrap up.
In sum:
- OLAP is a type of database workload.
- OLAP cubes are a specific data structure designed for OLAP workloads
- Some people think that OLAP can only be implemented on top of OLAP cubes. This is mistaken.
I’ll admit that I’ve written this piece for one other reason. When I first started in the world of Business Intelligence, I was terribly confused by the marketing materials and random blog posts I would find on OLAP and OLAP cubes. It was only in the process of writing my previous post that I finally sat down to dig into the history — and research! — of both concepts. This post presents a summary of what I’ve found.
The confusion between these two ideas are both widespread and real: to this very day, we get emails from prospects asking us if they can create OLAP cubes with Holistics. The answer is nuanced and rather difficult to articulate — it really depends on what they mean by OLAP and what they mean by cube.
My hope now is that we can put these doubts to rest. I also hope that people will stop writing ‘the death of OLAP’ articles; it really doesn’t make sense given that OLAP is a form of database use. But I don’t know — words are weird, and definitions are whatever people say they are. Perhaps it’s too much to ask.