How We Zeroed on the Best Data Warehouse

Zeroed on the Best Data Warehouse

The advancement of technology generally evokes a range of emotions in people from all walks of life. Some choose to see the cons, while others consider it a way to bring the world closer and solve major challenges.
But, as they say, any sufficiently advanced technology is equivalent to magic! Let us walk you through yet another blog post that amazes you with the goodies that technology can offer. A Quick Recap: piStats- Kong-The API Gateway  We discussed the pros and cons of using the ‘Kong API’ gateway and discussed the reasons that made us opt for it. ‘Kong’ as we discussed, is the first touchpoint and milestone of a user’s click-event journey - a journey that begins from the user’s browser or mobile, to ‘piStats Realtime’ dashboards. Going Forward: Big Data Management In this blog, we will discuss how to store and retain the enormous data, on a long-term; to derive our historical dashboards with user segmentation. We’ll talk about our data warehouse and the various alternatives that we tried before finally zeroing down to ‘Google BigQuery’. The data that we collect from the user’s click, needs to be stored – over a long term - for historical analysis and determining different user segments. This then feeds into the analytics dashboards or other campaign management systems that help us send push notifications, emails, etc. The data should be stored in such a way that it is easily accessible, without having to wait for long to get the desired results. Getting Started: Why Do We Need a Data warehouse? Apart from long-term storage, a data warehouse also serves as the operational data store to dump data away from the transactional storage, avoid load on the transactional database, and run reports & analyses on historical data. In a nutshell, it acts as a data repository of integrated data, from one or more sources.
Simple, isn’t it? Now, let’s see how we integrated a data warehouse with piStats. piStats and Data warehouse piStats uses data warehouse for two of its main features:

  • for the historical dashboards and</li>
  • for cohort analysis - used for reporting and push notifications.</li>

The historical dashboard has three different views based on granularity, i.e., Daily, Weekly, and Monthly. It also allows for date selection to get a report within desired range and granularity.
Users can also create cohorts using a user-friendly custom query engine to create desired segments.
Another important feature of piStats, i.e., Retention analysis, also relies on the data warehouse to compute user retention, every 6 hours. And finally, the data warehouse also helps piStats to analyse and fetch data for training the Machine Learning models, build intelligence, and perform predictive analysis, based on the historical data that it fetches. But, what’s so special? We required a data warehouse that could fulfill all the requirements of piStats with zero compromise on performance. In other words, it should support the limitless possibilities of piStats with an eye on:

  • Scaling and Integration: The data warehouse should be able to scale with the increasing size of the data set. For one of our clients, we added about 150 GB of data every month - we did not want to over provision, neither did we want to change scales frequently.</li>
  • Time Optimization: We wanted the query performance to be optimal so that the user firing the query from the custom query engine should not be kept waiting for long. By long, we mean, less than a minute!</li>
  • Cost Optimization: We wanted a pricing model that should charge for the storage and compute levels that we use and not for the default, packaged storage and compute levels that one normally gets with a data warehouse.</li>
  • Data Storage and Maintenance: Auto-purging was also one of the deciding factors. We wanted to limit data storage, enough to maintain last 3 months of data; rather than keeping the whole data-set persisting. This was a feature that came out-of-the-box in Google BigQuery.</li>

Why Not Redshift?

  • Storage and Compute Scaling: Redshift (Guide to Redshift Remodeling) provides fixed storage and compute nodes that needs to be scaled (up/down) as per requirement. Where scaling and re-partitioning is a good 4 to 5 hours of job, leaving the system available only for reads, and withholding writes.</li>
  • Pricing: Every node present in the cluster is priced even if it is not being utilized, or is idle; which, in our case is very relevant. Queries are fired at midnight mostly, and sparsely over the whole day but the load is present throughout 24 hours.</li>
  • Performance: Since our load was about 800 GB, and cost was a concern, we wanted to keep the nodes as little as possible. Even with partitioning in place, the queries took 4 to 5 minutes to give results, which was quite a bit for a user waiting to get results on the frontend. Increasing the number and size of nodes would have left the storage unused and pushed the costs up.
  • Data Deletion: We wanted to retain data only for 3 months in the data warehouse and required a mechanism to remove any data that is older than 3 months. Redshift required a ‘cron’ to be written for this purpose</li>

</ul> Why Not Impala on AWS EMR?

  • Storage and Compute Scaling: Impala was run on AWS EMR and hence required fixed memory and compute nodes to be added as per scaling requirements. If Impala was given nodes short on memory, it terminated the query with an Out of Memory error, hence being unreliable. The over-provisioned clusters left unused resources, adding to the cost. To avoid cost for the time when the Data warehouse was not used, we usually terminated the cluster. So, whenever a new cluster was provisioned a good 800 GB data was reloaded adding to the time of cluster provisioning.</li>
  • Pricing: In Impala, as well, one is charged for every node that is added into the cluster - be it on AWS EMR, or an independent Hadoop cluster.</li>
  • Version: Impala on EMR has no direct support - like other Hadoop ecosystem components namely Spark, Hive, etc. It needs to be installed using bootstrap actions which installs older versions of Impala, resulting in reduced features.</li>
  • Performance: Impala also took a good 10-11 minutes to return results for the queries and if under provisioned, it would go out of memory without executing the query – needless to say, that was quite frequent in our case!</li>

A Comparative Analysis So, before we knew it, in our attempt to find the best-fitting data warehouse for our requirements, we had tested 2 other alternatives along with BigQuery. Here’s a comparative chart summarizing the technical features of Redshift, Impala and BigQuery:

Metrics Redshift Impala(AWS EMR) BigQuery
Cost Charged per hour for each instance Charged per hour for each instance Charged for storage and data querying only.
Provisioning Chosen as per storage and memory requirement. Chosen as per storage and memory requirement. No notion of hardware, hence no provisioning.
Scaling Nodes can be added or removed Nodes can be added or removed Not required as no hardware.
Reliability Query executes till completion Query might give an out of memory error. Query executes till completion
Query Performance Limited by the number of CPUs and memory. Limited by the memory. Has no limitation and uses maximum resources available to get fast results
Maintainance Requires periodic vacuum operations None None
Partiotioning and Distribution(keys) Important for Query performance Important for Query performance Important to limit the amount of data queried. The query performance fast without partioning as well.
Streaming Data Not Possible Requires table to be refreshed. Allows streaming inserts using APIs
Versions None Limited to 2.2 with EMR None
Bulk Upload 90GB in 4-5 hours 90GB in 1-2 hours with a large Core node 90GB in 4-5 hours
Join Queries Slow Limited with the versions available Fast
Count Distinct Slow Allowed but extremely slow. Used NDV instead Fast
TimeStamp Datatype Supports different formats Limited format support Supports different formats
Deleting specific rows Allowed Not supported Only partitions can expire. Delete not supported.
Update Operation Allowed but slow Not supported Not supported

Alright then, do you now know why we chose BigQuery over the others? Have different opinions? Send it our way by commenting below.In our next blogpost we’ll validate our choice, discuss the ground realities and talk about our experiments with BigQuery. As they say, it is supposed to be automatic, but you must push the button, or scroll and read in this case. Till then, keep innovating and experimenting!

Anjna Bhati
Written by
Anjna Bhati

Anjna Bhati is Head-Data Analytics & AI in BluePi Consulting with Expertise in DWBI, analytics, big data, machine learning solutions on Cloud (AWS, Google, Azure) clickstream analytics, real time data pipelines and reporting.