It’s easier to fall prey to the hype surrounding big data and delve deep into the ocean that it is, without laying the right kind of foundation. The foundation determines what kind of an environment you build and therefore how flexible and scalable your big data capability, actually, is. And that’s where BluePi can help. With our expertise in building big data architecture for companies of repute, we can help you get the most without having to spend a bomb!
What is Big Data?
OK, first things first – Big Data is, quite simply, data that runs into the petabyte scale. Working on such a humongous scale is no mean task, but thanks to some pioneering work by companies like Google and Hadoop, we now have file systems that are built to handle such massive scale. And it’s only getting better by the day!
What’s so special about Big Data Infrastructure?
- Petabyte scale, of course; with an ability to scale up, and down
- Distributed File systems to store data compared to centralized storage in SQL
- Structure, Unstructured, Text, Videos, Images. You name it. You can save it.
- You need not worry about the schema or the relationship between fields, while storing them. You can write elaborate programs that take care of it later, while querying.
There’s a caveat though – like all good things in this world. Without the right architecture in place, the benefits of big data are seldom realized. And then there is the challenge of wading through such a huge, complex ecosystem comprising of multiple tools, techniques and frameworks. Deciding which ones to adopt and then build a robust stack based on it, is not an easy task, for sure.So, before we go any further, let’s quickly look at some of the core components of big data architecture:
Storage, Of Course
At the core of big data, is the storage system. Hadoop Distributed File System (HDFS) is one of the most commonly used storage frameworks, while NoSQL databases like MongoDB, HBase, etc. are also commonly used. You may also have some of the relational databases like MySQL. Quite simply, HDFS lets data of all formats to be stored in a single repository, while at the same time, distribute it across thousands of servers and make it available for processing, as and when required. The diversity of data one could store, combined with the speed of processing, sets this system apart, from anything else ever known. It has, quite literally, made big data possible!
The Hadoop Ecosystem
Sitting on top of the distributed files system is the cluster management and distributed processing layer like MapReduce which not only allocates clusters to copies of data but also labels and maps them so that you could run programs to process and retrieve the right set of data as and when required. You could even add tools like SOLR which directly work with MapReduce to help tag and store meta information for the stored data. To interact with the stored data and the MapReduce layer, you have two options – use traditional SQL-like querying tools like Hive or use scripting tools like Pig. This would then dump all data to either a system of data warehouse and data marts or feed in to a data lake.
There’s an important distinction between a data warehouse and data lake. While the former is almost entirely used to store structured data, data lake can store all the data as is, and therefore contains a mix of structured, unstructured or semi-structured data. This is especially useful for data scientists, and can prove to be an inexpensive option to store data, in comparison to using multiple data warehouses – especially while scaling up! However, data from data lakes needs a fair amount of post-processing – and therefore, can be less useful for the non-technical decision makers.
The data stored within the data warehouse and data lakes can then be fed to big data analytics tools to carry out predictive analytics, real-time analytics or to make recommendations. Check out all the big data jugglery that BluePi does, to churn out actionable business insights.
Check out our recent blogpost where we explore available data warehouse options for analytics and recommendation module of piStats – our custom product suite for media & entertainment businesses.
Given these multiple components and their utilities, it could get quite tricky, to arrive at the right eco-system to meet your business needs. Thankfully, to make our jobs easier, there are custom distributions of the Hadoop ecosystem that lets us choose from pre-built configurations. You could choose to go with the Hortonworks Data Platform (HDP) or Cloudera Distribution Hadoop (CDH) or get a custom stack designed right from the scratch. BluePi can help you with all three options.
Hadoop on Cloud?
Although Hadoop On-Premise setup can give you increased sense of security and control over all your data, the cost-effective, auto scaling options provided by a cloud-based setup could give you, could just tilt the tide! BluePi can help you here as well, whether you want to leverage the power of AWS Cloud, powered by Amazon Redshift or Big Query and whole lot more; or choose to go with the hybrid, Smart Cloud options provided by Microsoft Azure<e/em>, powered by HDInsight or even choose to go the IBM way and utilize the prowess of BlueMix and Watson.
So, the next time you think of Big Data, do get in touch with us. We'll have a cup of coffee, discussing the right solution with all the immaculate details!