A simple setup for people just starting out
Before we begin, let’s talk about a minimum viable analytics stack. We will then move on to a full treatment of a data analytics system in the next section.
The Three Things
Every data analytics system does three basic things.
- You have to load data into a central repository of data, usually a data warehouse.
- You have to transform or model your data so that it ends up nice and sparkly and ready for consumption in your data warehouse.
- You have to get that data to business decision-makers, where it may be used.
Querying from your production DB
If you are at a very early stage in your company’s life and you fit the criteria below, it is likely that you can skip some of the steps above:
- If your data comes from only one source (which is most likely your production database) then you can skip the data loading process.
- If you are running a simple, low-traffic website and having an additional analytics workload on your production database will not make a huge dent on your application performance, then you can skip the data warehouse.
- If your raw data is simple enough to be visualized out of the box, or your reporting needs are so simple that you need no complex transformations, then you can skip the data transform and modeling process.
In short, your initial analytics setup can be very simple: just hook business intelligence tool up to the production database of your application.
When you interact with dashboards in your BI tool, the data will be queried live from your application database. As a result, you also happen to be getting the data in real-time.
Querying from a replica DB
The simple setup above, of course, is not all wonderful.
The highest risk you will face with the above setup is performance. Besides your normal production workload, your database now will also take on an additional analytics workload, and may degrade your users’ experience.
The easiest solution here is to set up a replica of your production database.
Check with your dev team to see if they have a replica that you can connect to. There’s a good chance that your dev team has already set something up.
Then, instead of connecting to production, point your BI tool to your replica instead.
Of course, you could do some really bizarre things like export a dump, load that into a local database, and then query that — but most companies we know outgrow such a workflow in a matter of weeks.
When you connect to a replica, your production database will not be burdened by your analytical workload, while the data you receive remains relatively fresh (depending, of course, on how your dev team configures the replication interval).
If your app runs on a NoSQL database
If you run on a NoSQL database (for instance, on something sexy like MongoDB or Cassandra), the above setup will not work for two reasons:
Limited choice of reporting tool.
Since SQL has become the defacto standard for analytics, most BI tools are designed to work with SQL, so that limits the choice of BI tools you may pick.
Also, there is no standardized analytics interface across different NoSQL databases, so you will end up looking for a specialized solution that is designed specifically for your brand of NoSQL database.
Limited analytics capability.
Most NoSQL databases do not have strong support for analytics, both in terms of processing engine and querying interface. Try and write an aggregation query on a big MongoDB collection, and you will quickly understand why we say this. The query will be difficult to write, and the wait for results would be horrific.
You might even get a surprise visit from your dev team for hogging up their production database.
In this situation, the recommended approach is again getting an SQL data warehouse, and then loading data from your NoSQL app into this data warehouse. This moves us towards the direction of a proper analytics setup.
Alright, them’s the basics. Now let’s talk about the vast majority of data analytics stacks we see out there.