In this series, we make an attempt to chronicle our experience and best practices with redshift having used it in ‘anger’ in many projects. In this part 1 of the series, we look for appropriate schema design for redshift, the various alternatives and the pros and cons of each.
Proper dimensional model is an absolute need for Redshift to perform well. Three different dimensional models work best with Redshift:
To understand the differences between the three modeling schemes let’s get some terminology out of the way. There are two types of tables that we create in a dimensional model Facts and Dimensions. The fact tables are the measures, while the dimension tables contain descriptive attributes. So, the facts usually have numbers and we use some numeric operations like sum, count, average on them. On the contrary, the dimensions are used to filter or categorize the data. The number of unique, in a dimension, would always be an order of magnitude smaller in comparison to the number of facts. The difference between Star and Snowflake Schema manifests in the ‘normalization’, we do in the dimensions. In the Snowflake schema, the dimension tables are d iteratively whereas in Star there is only one level of normalization. Examples So, let’s visualize this with two tables Sales fact and Store dimension table. The Sales fact stores the amount of product sold from each store. Star Schema
Here store and product are dimension tables while sales are the fact table. Data in dimension tables do not change frequently and as mentioned above, contains the descriptive attributes. Here, the store table stores the outlet information while the product is the collection of all types of products produced by the company. Fact table sales, rather stores every transaction or sale of a product made by the individual store, so it is highly dynamic in nature and usually the heaviest one as well. Note all the columns in sales are of integer type, which makes it easy to do aggregations, indexing for queries to run faster. Also, fact tables must be distributed across different redshift clusters by either stores or products to gain huge improvements in query performance. Snowflake Schema
Do you see a problem with Star Schema? There is one, dimension table “store” is not normalized. That means if there are 1000 stores in India, country ‘India’ would be repeated 1000 times. Same is the case with city and state. This makes dimension tables unnecessary bulkier. While denormalizing extensively increases the number of joins to fetch the data, which adds to computing time and adds to query execution. Flat Schema There exists another model i.e. flat model, which is essentially another level of denormalization over the star schema. The state, city, country and store all get folded into the fact table. Here are the pros and cons of the three modeling choices.
|Type||Normalization||Ease of Query||Storage size||Joins|
If you are using Redshift for data marts where query performance is critical the right choice is a snowflake schema. if you are using redshift as generic data warehouse it is preferred to build a star schema. To know more about such cutting-edge technology, interesting use cases and deep insights to dwell upon and transform your business growth, reach out to us. Till then, keep innovating!