The advancement of technology generally evokes a range of emotions in people from all walks of life. Some choose to see the cons, while others consider it a way to bring the world closer and solve major challenges.
But, as they say, any sufficiently advanced technology is equivalent to magic! Let us walk you through yet another blog post that amazes you with the goodies that technology can offer.
A Quick Recap: piStats- Kong-The API Gateway
We discussed the pros and cons of using the ‘Kong API’ gateway and discussed the reasons that made us opt for it. ‘Kong’ as we discussed, is the first touchpoint and milestone of a user’s click-event journey - a journey that begins from the user’s browser or mobile, to ‘piStats Realtime’ dashboards.
Going Forward: Big Data Management
In this blog, we will discuss how to store and retain the enormous data, on a long-term; to derive our historical dashboards with user segmentation. We’ll talk about our data warehouse and the various alternatives that we tried before finally zeroing down to ‘Google BigQuery’.
The data that we collect from the user’s click, needs to be stored – over a long term - for historical analysis and determining different user segments. This then feeds into the analytics dashboards or other campaign management systems that help us send push notifications, emails, etc. The data should be stored in such a way that it is easily accessible, without having to wait for long to get the desired results.
Getting Started: Why Do We Need a Data warehouse?
Apart from long-term storage, a data warehouse also serves as the operational data store to dump data away from the transactional storage, avoid load on the transactional database, and run reports & analyses on historical data.
In a nutshell, it acts as a data repository of integrated data, from one or more sources.
Simple, isn’t it? Now, let’s see how we integrated a data warehouse with piStats.
piStats and Data warehouse
piStats uses data warehouse for two of its main features:
The historical dashboard has three different views based on granularity, i.e., Daily, Weekly, and Monthly. It also allows for date selection to get a report within desired range and granularity.
Users can also create cohorts using a user-friendly custom query engine to create desired segments.
Another important feature of piStats, i.e., Retention analysis, also relies on the data warehouse to compute user retention, every 6 hours.
And finally, the data warehouse also helps piStats to analyse and fetch data for training the Machine Learning models, build intelligence, and perform predictive analysis, based on the historical data that it fetches.
But, what’s so special?
We required a data warehouse that could fulfill all the requirements of piStats with zero compromise on performance. In other words, it should support the limitless possibilities of piStats with an eye on:
Why Not Redshift?
Why Not Impala on AWS EMR?
A Comparative Analysis
So, before we knew it, in our attempt to find the best-fitting data warehouse for our requirements, we had tested 2 other alternatives along with BigQuery. Here’s a comparative chart summarizing the technical features of Redshift, Impala and BigQuery:
|Cost||Charged per hour for each instance||Charged per hour for each instance||Charged for storage and data querying only.|
|Provisioning||Chosen as per storage and memory requirement.||Chosen as per storage and memory requirement.||No notion of hardware, hence no provisioning.|
|Scaling||Nodes can be added or removed||Nodes can be added or removed||Not required as no hardware.|
|Reliability||Query executes till completion||Query might give an out of memory error.||Query executes till completion|
|Query Performance||Limited by the number of CPUs and memory.||Limited by the memory.||Has no limitation and uses maximum resources available to get fast results|
|Maintainance||Requires periodic vacuum operations||None||None|
|Partiotioning and Distribution(keys)||Important for Query performance||Important for Query performance||Important to limit the amount of data queried. The query performance fast without partioning as well.|
|Streaming Data||Not Possible||Requires table to be refreshed.||Allows streaming inserts using APIs|
|Versions||None||Limited to 2.2 with EMR||None|
|Bulk Upload||90GB in 4-5 hours||90GB in 1-2 hours with a large Core node||90GB in 4-5 hours|
|Join Queries||Slow||Limited with the versions available||Fast|
|Count Distinct||Slow||Allowed but extremely slow. Used NDV instead||Fast|
|TimeStamp Datatype||Supports different formats||Limited format support||Supports different formats|
|Deleting specific rows||Allowed||Not supported||Only partitions can expire. Delete not supported.|
|Update Operation||Allowed but slow||Not supported||Not supported|
Alright then, do you now know why we chose BigQuery over the others? Have different opinions? Send it our way by commenting below.In our next blogpost we’ll validate our choice, discuss the ground realities and talk about our experiments with BigQuery.
As they say, it is supposed to be automatic, but you must push the button, or scroll and read in this case. Till then, keep innovating and experimenting!
Anjna Bhati is Software Engineer with BluePi Consulting and right now working on feature enhancement of piStats.