OLAP databases - A Primer

OLAP databases - A Primer

OLAP vs OLTP and the evolution of OLAP

We all are aware on the traditional relational databases such as MySQL, Oracle, Postgres etc. These are what we use daily for daily interaction through our CRUD (Create, Read, Update and Delete) apps. Whilst these are immensely useful and needful technology; there has been a new type of databases that are for generating insights on the transactions that have occurred in OLTP/traditional relational databases. That is where OLAP databases come in; they provide a historical context on the normal business operations.

OLTP can act as a analytical source of truth as is, but when you scale today's data, it becomes a complex system with huge amounts of data. These can lead to queries taking hours to run. Also, scalability and high availability at that scale becomes difficult, if a single-node architecture is used.

OLAP has gone through various iterations over the decades in order to solve the problem of complex queries, high availability, good querying performance.

Firstly, we'll look into the Shared-* architectures (Shared Disk vs Shared Nothing Architecture)

Shared Architectures

Sharing means three things: Compute, Memory and Storage

Shared-Everything Architecture

This is a single node architecture. Everything is shared. Example would be a UNIX OS where all things is in same machine. This is great for co-location but not good for fault tolerance and performance. Single server means single point of failure, hence no fault tolerance. All data in one machine without partitioning means performance degradation with larger data.

Shared-Disk and Shared-Nothing provides some improvements in each of the criteria.

Shared-Nothing Architecture

Each Database node shares:

  1. CPU

  2. Memory

  3. Storage

These Nodes talk to each other via a shared Network. Data is split/partitioned into subsets across the nodes.

Data is fetched via POSIX API [?]

Shared-Nothing is harder to scale data capacity due to tightly-coupled storage layer. But these can have better performance due to co-location of data and compute per instance.

Shared-Disk Architecture

Each Database node shares the compute layer:

  1. CPU

  2. Memory

The storage layer is decoupled and nodes can talk to the storage and with each other via Network layer(s).

Data is fetched via userspace API instead of a POSIX API such as fread(0)

Shared-Disk is easier to scale data capacity due to loosely-coupled storage layer. But these can have lead to higher computing resources utilization as the data needs to be moved from storage -> compute layer and perform filter at compute layer.

Row Oriented vs Column Oriented Stores

Paper Link

One of key difference between OLAP and OLTP databases is how the data is stored.

In OLTP, as the data is queried based on IDs (Mostly), a single row/tuple is mainly the querying layer.

In OLAP, mostly the querying is based on Columns with several GROUPBY statements.

Hence, the traditional databases such as MySQL, Oracle, Postgres used row-oriented storage whilst OLAP databases such as MonetDB, Vertica use Column-oriented stores.

Row-Oriented Store

Here data is stored as rows at storage layer.

For example

|ID| Name | Department |Role|
|10|Ram   | Devops     |SDE1| 
|20|Sam   | Security   |SDE2| 
|30|Aditi |Software Dev|SDE3|

In a traditional DB, ID#20 would be a query to get information on Sam. There could be some JOINs but the data in most cases would be for a particular row.

Row-oriented can be changed to column-oriented (as early OLAPs did by forking Postgres to be column-oriented); but these have the baggage of row-oriented metadata which does not bring major performance for Analytical workloads

Some implementation that were done to move row-oriented to column oriented-database

Improving Row-oriented datastore to become Column-oriented

WhatHowDisadvantage
Vertical Partitioning- Connect column by adding position attribute to every column.- Row stores have heavy metadata info for every row. This increases space as this is not needed fro column scans
Index-only scansB+Tree index for every columnIndex scan for every read can be slower than traditional heap-file scan in vertical partitioning
Materialized ViewsCreate a temporary data fro faster accessChanging business need might make Materialized view largely irrelevant.

Column-Oriented Store

Same data is stored as columns at storage layer.

For example

|ID| Name | Department |Role|
|10|Ram   | Devops     |SDE1| 
|20|Sam   | Security   |SDE2| 
|30|Aditi |Software Dev|SDE3|

In a OLAP DB, queries are mostly related to columns, like show me the list of users whose salary is >$1000 who belong to the SDE2 role. Here, it makes sense to store data in columns so that the GROUPBY clauses are more efficient.

Efficiency in Column-oriented

  1. COMPRESSION

It is easier to compress data in column-oriented data as similar data (Columns) are stored nearby. For example, if timestamp is present in column; a run-length encoded scheme can be used where the first timestamp can be stored as a Unix timestamp; whilst the rest of data can be stored as increment. This was lower the space cost as increments can be stored in less number of bits.

  1. Late Matrialization

As data is located in separate places, the data is formed at the end as a lot of combining of columns takes place at later stage.

For aggregate, using Column-store makes more sense as a lot of unnecessary columns can be read in row-stores which are discarded in the query execution cycle.

All big data formats are looking into storing data in column stores.
Examples for column-oriented file formats:

  1. Apache Arrow: Store data in columnar data memory format

  2. Apache Parquet: Column-oriented data format for efficient data storage and retrieval

  3. Apache ORC: Columnar storage for Hadoop database

And examples for column-oriented data stores:

  1. Clickhouse: Storing databases in column

  2. DuckDB: Columnar-vectorized query execution engine

OLAP over the decades

1990s - 1st first generation OLAPs - DATA CUBES

Here the data from disparate sources are fetched and combined into a data cube.
These are then aggregated which can then be queried for analyitcal purposes.

But this was created in the 1990s, where the data was simple and less granular than we have today. Now we have all sorts of data with varying types present in our databases. The simple Employee database with simple columns no longer exist. Hence, it lead to poor scalability and flexibility.
Also, since data sources were from different sources, it lead to higher data duplication which in turn lead to higher data storage costs.

Examples:

ToolNote
MSFT SQL ServerData Cube
OracleData Cube
TerradataOLAP from day1
SysbaseData Cube
IBM DB2Data Cube
Informateica

2000s - Second Generation Architectures

In 2000s, there was a new paradigm of ETL (Extract, Transform, Load)

  1. Extract: Pull raw data from multiple sources. This could be databases, CSVs, JSON etc

  2. Transform: Transform data to be dumped to OLAP database. Transformation could be cleaning data, creating proper data model as per business need etc

  3. Load: Load into a database which can be consumed by business via tools such as PowerBI, Tableau, DOMO etc.

A lot of Data-warehouse came into picture such as MonetDB, Vermica.
A hidden fact is most of these data-warehosues were forks of traditional Databases which were modified to improved coulmn-oriented analytical queries. This was discussed at Improving Row-oriented datastore to become Column-oriented section

Examples:

DBNote
NetezzaForks of Potgres
ParAccelForks of Potgres
MonetDBDuckdb v1 was based on MonetDB. V2 is not MonetDb
GreenplumForks of Potgres
DataAllegroMicrosoft bought
VerticaFork of Potgres

2010s and now: Rise of Decoupled OLAPs

  1. Send Data to a decoupled Object Storage

  2. Send Metadata information to a catalog/metadata store

  3. Query Engines queries the catalog/metadata store to get the location and other information on how to query data.

  4. Query Engine queries the object storage and pulls data.

  5. Processes the data based on filtering criteria and sends result to client.

Modern OLAP has moved away from traditional ETL workflow to a ELT workflow, where the transform step happens alter load. dbt is one of the technologies that leverages ELT to transform data immediately after extracting.

Examples:

DBWhatIntegration
HiveDW system which is used with Hive-metastore. Not ideal for object storage which is part of modern OLAPIntegrates with Hadoop such as HDFS, Pig etc
Apache DrillSchema-less Query engine for Hadoop, NoSQL and Cloud storageIntegrates with Hadoop ecusystem
DuckdbIn-porcess OLAPSupports various file format such as CSV, Parquet JSON etc
DruidReal-time analytics DB for batch and streaming dataStreaming such as Kafka
PrestoSQL query engine by FacebookIntegrate with various file format, OLAP DBs and visualisation tools
SnowflakeWritten from groundup and is a whole platform for data engineering including hosting, querying, RBAC etcIntegrates with everything
PinotRealtime distributed OLAP datastore from LinkedinIntegrates with Kafka, S3, Tableau, Presto etc
RedshiftBuilt on top of ParAccelIntegrates nicely with AWS ecosystem
Apacka SparkData analytics engine to do high-volume data processingIntegrate with object stores etc
Google BigqueryGoogle's OLAPIntegrate with GCP

ETL vs ELT and the DBT Approach

1. ETL

The three phases as discussed in the 2000s database section can be represented here:

  1. ELT

  1. DBT: Implementing ELT in real-world

Advantages of ELT vs ETL

  1. Versioned Control workflows: As in ELT; data transformation requires a feedback with data load; it is possible to allow VCS and CI/CD workflows on the data which is a common practice in Software engineering and Agile teams

  2. Faster feedback loop: As data is transformed on the fly, we can get a better feedback on insights from data, instead of waiting for a cron job that executes the data dump into datawarehouse.

Push vs Pull between Data & Query

There are a couple of ways to execute query in OLAP ecosystem. These are

  1. Push Query to Data

  2. Pull Data to Query

1.Push Query to Data

In the earlier days, networking and data fetching was slow and expensive. So idea was to send query directly to the OLAP database system. Then data (based on executed Query) was sent back to client over the network.

2.Pull Data to Query

Now we have modern OLAP with Shared-Disk architecture where the compute cannot be possible in shared-disk-not-compute architecture. Also, the network and data fetching (thanks to SSDs) speeds have improved. This allows of moving the data to the query engine(as discussed earlier) for processing.

Modern OLAP does the combination of 1. and 2. in context of what type of dat a workload, how frequent data are accessed (available via data querying statistics); decisions are made whether to bring query to data; or data to query.

Conclusion

Hope you learned the need of OLAP databases and why there has been a degree of change for every decade in the OLAP world.

The following the references without which this could not have been possible:

References

  1. Paper Link: Column-Stores vs. Row-Stores: How Different Are They Really?

Did you find this article valuable?

Support Ayman Patel by becoming a sponsor. Any amount is appreciated!