Redshift Columnar Storage explained
One the key features that make Amazon Redshift an amazing data warehousing solution is its storage model. In this short post we will see how Columnar storage makes storing and retrieving large amount of data more efficient and cost effective.
In order to store the data more efficiently Database Management Systems allocate logical database space for all data in a database. Data Blocks are low level logical unit which correspond to a specific number of bytes of physical database space on disk.
In an ideal situation your record sizes are the same as block sizes, this means that one record will be stored in one block.
Let’s say you have a database for your order processing. The database has a table called ORDER that stores information about teach order that is entered. Here is a sample of the records:
In a Row based storage data blocks store records sequentially. A logical view of how this data is stored is something like this:
This is an ideal situation for OLTP systems where you usually read or write all the columns of one or few records at a time. Assuming you have an index on the ORD_KEY column, finding the block that contains the data for order 100021 record is very fast and with minimal I/O you can update the record.
However if your goal is to aggregate the AMOUNT column and group it by let’s say PRD_KEY for the last ten years, 1 billion records, then the above storage model is not very efficient.
For this kind of queries and analysis Columnar storage is a much better choice, where the values of each column are stored in blocks. A logical view of the above records in Columnar storage is something like this:
As you can see Block 1 now contains values for the first column. This means that the number of records that can be stored in each block gets multiplied.
In the Row storage if you wanted to read the ORD_KEY column for the three records you needed three I/O operations since the values are scattered over three blocks. However in the Columnar storage you can read all three ORD_KEY values with one I/O operation.
Another benefit of Columnar storage is in the way you can use compression. Since the data type of each block is the same, compression can be applied much more efficiently.
This means the disk storage will be reduced which further reduces the I/O operations. In addition reduced disk space means you can store more records into the memory and increase the speed of aggregations, sorting and calculations.
Going back to our analysis example you should now see how Columnar storage can help you aggregate the AMOUNT column and group it by PRD_KEY much faster.