Finding BigQuery! How We Zeroed In on the Best Data Warehouse for piStats
The advancement of technology generally evokes a range of emotions in people from all walks of life. Some choose to see the cons, while others consider it a way to bring the world closer and solve major challenges. But, as they say, any sufficiently advanced technology is equivalent to magic! Let us walk you through yet another blog post that amazes you with the goodies that technology can offer.A Quick Recap: piStats- Kong-The API Gateway We discussed the pros and cons of using the ‘Kong API’ gateway and discussed the reasons that made us opt for it. ‘Kong’ as we discussed, is the first touchpoint and milestone of a user’s click-event journey - a journey that begins from the user’s browser or mobile, to ‘piStats Realtime’ dashboards.Going Forward: Big Data ManagementIn this blog, we will discuss how to store and retain the enormous data, on a long term; to derive our historical dashboards with user segmentation. We’ll talk about our data warehouse and the various alternatives that we tried before finally zeroing down to ‘Google BigQuery’. The data that we collect from the user’s click, needs to be stored – over a long term - for historical analysis and determining different user segments. This then feeds into the analytics dashboards or other campaign management systems that help us send push notifications, emails, etc. The data should be stored in such a way that it is easily accessible, without having to wait for long to get the desired results.Getting Started: Why Do We Need a Data warehouse?Apart from long term storage, a data warehouse also serves as the operational data store to dump data away from the transactional storage, avoid load on the transactional database, and run reports & analyses on historical data.In a nutshell, it acts as a data repository of integrated data, from one or more sources.Simple, isn’t it? Now, let’s see how we integrated a data warehouse with piStats.piStats and Data warehousepiStats uses data warehouse for two of its main features:
- for the historical dashboards and
- for cohort analysis - used for reporting and push notifications.
- Scaling and Integration: The data warehouse should be able to scale with the increasing size of the data set. For one of our clients, we added about 150 GB of data every month - we did not want to over provision, neither did we want to change scales frequently.
- Time Optimization: We wanted the query performance to be optimal so that the user firing the query from the custom query engine should not be kept waiting for long. By long, we mean, less than a minute!
- Cost Optimization: We wanted a pricing model that should charge for the storage and compute levels that we use and not for the default, packaged storage and compute levels that one normally gets with a data warehouse.
- Data Storage and Maintenance: Auto-purging was also one of the deciding factors. We wanted to limit data storage, enough to maintain last 3 months of data; rather than keeping the whole data-set persisting. This was a feature that came out-of-the-box in Google BigQuery.
- Storage and Compute Scaling: Redshift provides fixed storage and compute nodes that needs to be scaled (up/down) as per requirement. Where scaling and re-partitioning is a good 4 to 5 hours of job, leaving the system available only for reads, and withholding writes.
- Pricing: Every node present in the cluster is priced even if it is not being utilized, or is idle; which, in our case is very relevant. Queries are fired at midnight mostly, and sparsely over the whole day but the load is present throughout 24 hours.
- Performance: Since our load was about 800 GB, and cost was a concern, we wanted to keep the nodes as little as possible. Even with partitioning in place, the queries took 4 to 5 minutes to give results, which was quite a bit for a user waiting to get results on the frontend. Increasing the number and size of nodes would have left the storage unused and pushed the costs up.
- Data Deletion: We wanted to retain data only for 3 months in the data warehouse and required a mechanism to remove any data that is older than 3 months. Redshift required a ‘cron’ to be written for this purpose
- Storage and Compute Scaling: Impala was run on AWS EMR and hence required fixed memory and compute nodes to be added as per scaling requirements. If Impala was given nodes short on memory, it terminated the query with an Out of Memory error, hence being unreliable. The over-provisioned clusters left unused resources, adding to the cost. To avoid cost for the time when the Data warehouse was not used, we usually terminated the cluster. So, whenever a new cluster was provisioned a good 800 GB data was reloaded adding to the time of cluster provisioning.
- Pricing: In Impala, as well, one is charged for every node that is added into the cluster - be it on AWS EMR, or an independent Hadoop cluster.
- Version: Impala on EMR has no direct support - like other Hadoop ecosystem components namely Spark, Hive, etc. It needs to be installed using bootstrap actions which installs older versions of Impala, resulting in reduced features.
- Performance: Impala also took a good 10-11 minutes to return results for the queries and if under provisioned, it would go out of memory without executing the query – needless to say, that was quite frequent in our case!
|Cost||Charged per hour for each instance||Charged per hour for each instance||Charged for storage and data querying only.|
|Provisioning||Chosen as per storage and memory requirement.||Chosen as per storage and memory requirement.||No notion of hardware, hence no provisioning.|
|Scaling||Nodes can be added or removed||Nodes can be added or removed||Not required as no hardware.|
|Reliability||Query executes till completion||Query might give an out of memory error.||Query executes till completion|
|Query Performance||Limited by the number of CPUs and memory.||Limited by the memory.||Has no limitation and uses maximum resources available to get fast results|
|Maintainance||Requires periodic vacuum operations||None||None|
|Partiotioning and Distribution(keys)||Important for Query performance||Important for Query performance||Important to limit the amount of data queried. The query performance fast without partioning as well.|
|Streaming Data||Not Possible||Requires table to be refreshed.||Allows streaming inserts using APIs|
|Versions||None||Limited to 2.2 with EMR||None|
|Bulk Upload||90GB in 4-5 hours||90GB in 1-2 hours with a large Core node||90GB in 4-5 hours|
|Join Queries||Slow||Limited with the versions available||Fast|
|Count Distinct||Slow||Allowed but extremely slow. Used NDV instead||Fast|
|TimeStamp Datatype||Supports different formats||Limited format support||Supports different formats|
|Deleting specific rows||Allowed||Not supported||Only partitions can expire. Delete not supported.|
|Update Operation||Allowed but slow||Not supported||Not supported|