In this series, we make an attempt to chronicle our experience and best practices with redshift having used it in ‘anger’ in many projects. In this part 1 of the series, we look for appropriate schema design for redshift, the various alternatives and the pros and cons of each.
Proper dimensional model is an absolute need for Redshift to perform well. Three different dimensional models work best with Redshift:
To understand the differences between the three modeling schemes let’s get some terminology out of the way.
There are two types of tables that we create in a dimensional model Facts and Dimensions. The fact tables are the measures, while the dimension tables contain descriptive attributes. So, the facts usually have numbers and we use some numeric operations like sum, count, average on them. On the contrary, the dimensions are used to filter or categorize the data. The number of unique, in a dimension, would always be an order of magnitude smaller in comparison to the number of facts.
The difference between Star and Snowflake Schema manifest in the ‘normalization’, we do in the dimensions. In the Snowflake schema, the dimension tables are normalized iteratively whereas in Star there is only one level of normalization.
So, let’s visualize this with two tables Sales fact and Store dimension table. The Sales fact stores the amount of product sold from each store.
Here store and product are dimension tables while sales are the fact table.
Data in dimension tables do not change frequently and as mentioned above, contains the descriptive attributes. Here, store table stores the outlet information while the product is the collection of all types of products produced by the company.
Fact table sales, rather stores every transaction or sale of a product made by the individual store, so it is highly dynamic in nature and usually the heaviest one as well. Note all the columns in sales are of integer type, which makes it easy to do aggregations, indexing for queries to run faster.
Also, fact tables must be distributed across different redshift clusters by either stores or products to gain huge improvements in query performance.
Do you see a problem with Star Schema? There is one, dimension table “store” is not normalized. That means if there are 1000 stores in India, country ‘India’ would be repeated 1000 times. Same is the case with city and state. This makes dimension tables unnecessary bulkier.
While denormalizing extensively increases the number of joins to fetch the data, which adds to computing time and adds to query execution.
There exists another model i.e. flat model, which is essentially another level of denormalization over the star schema. The state, city, country and store all get folded into the fact table.
Here are the pros and cons of the three modeling choices.
### The three approaches compared
|Type||Normalization||Ease of Query||Storage size||Joins|
If you are using Redshift for data marts where query performance is critical the right choice is a snowflake schema. if you are using redshift as generic data warehouse it is preferred to build a star schema.
To know more about such cutting-edge technology, interesting use cases and deep insights to dwell upon and transform your business growth, reach out to us at email@example.com
Till then, keep innovating!
This blog has been written by Mr. Pronam Chatterjee.
As founder and CEO of BluePi Consulting, Pronam Chatterjee has been instrumental in driving the vision and strategy of the company. Pronam firmly believes that cutting-edge technology should be used to solve complex, real-world problems. He has an eye for spotting the big technological waves early and leveraging them, much before rest of the crowd wakes up to those.
A techie at heart, he started his career with the PWC, India, as a consultant on architecture and software development; and now has close to 20 years of experience working across companies like ACS, Sapient, Polaris, AON Hewitt, VMWare, etc. He was the CTO of Xerox Products in India, before founding BluePi in 2013.