You can’t have a conversation about Big Data nowadays without talking about Hadoop. This open-source software platform has become synonymous with Big Data. But what exactly is Hadoop?
Hadoop in a nutshell
Hadoop is basically a solution to store and analyze enormous amounts of data across a distributed cluster of servers running on commodity hardware. It is designed and implemented to be robust, modular, sustainable, efficient and above all FAST!
What about the Apache Software Foundation?
The Apache Software Foundation is a decentralized community of developers which provides support for the Apache community of open-source software projects. The official definition of Hadoop is as following:
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
What are the main components of Hadoop?
Hadoop has been designed to be modular and consists of a numerous amount of modules with each having their own task. The most essential parts of Hadoop are the data processing framework and the distributed file system for storage.
The distributed file system is the component that actually holds the “Big Data” and by default Hadoop uses the Hadoop Distributed File System (HDFS), although other (proprietary) files system might work as well (f.e. Apache CasandraFS or Amazon S3). HFDS is like the database (although it is not really a database) of the Hadoop cluster which contains all data to be retrieved and/or analyzed.
The data processing framework is software used to work with the data, or in other words the technology exposing the HFDS. By default Hadoop uses MapReduce. MapReduce is an open-source framework developed and introduced by Google for the execution of massive amount of calculations within a short period of time. It forms the heart of Hadoop in the sense that it actually processes the data.
How to query Hadoop?
Since Hadoop or its underlying HDFS is not a “normal relational” database we cannot simply fire a SQL-query to retrieve the data that we are looking for. Instead we need a framework like MapReduce to process, search and retrieve the data that we are looking for. Although MapReduce is known for its power and flexibility this also brings additional complexity into the picture. Luckely if you are not a Java guru there are applications out there to make your life much easier. One of them is Hive, which is included in the installation of Hadoop. This (other) Apache software converts SQL-alike queries (also called HiveQL) into MapReduce code, alternatively we could also use third-party proprietary plugins like the TIBCO ActiveMatrix BusinessWorks Plug-in for Big Data which facilitates the connectivity between your Enterprise Service Bus (ESB) and Hadoop by generating the MapReduce code from SQL queries. I already spoke briefly about this specific plugin in one of my previous blogs.
The Hadoop cluster
The typical components of a distributed Hadoop cluster are the JobTracker and one of more TaskTrackers. The JobTracker is the service running on the master node and is responsible for receiving and routing requests from the client to the specific TaskTracker (node) which ideally have the data that the client is looking for. This information is retrieved from the name node, which determines the location of the data. If for some reason the assigned primary TaskTracker is unreachable the JobTracker assigns the task to another TaskTracker where the replica of the data exists.
So the TaskTracker is a node that accepts tasks – to map, reduce or shuffle operations – from a JobTracker. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept concurrently. The TaskTracker spawns a separate JVM process to do the actual work. This is to ensure that process failure does not take down the TaskTracker. Via a heartbeat mechanism the TaskTracker notifies the JobTracker about the status and empty slots available so the JobTracker can stay up to date with where in the cluster work can be delegated.
The Hadoop cluster has another big advantage: since both data (data node) and data processing (TaskTracker) resides on the same servers it is very easy to expand the storage capacity and/or processing power by adding another server to the cluster.
Who uses Hadoop?
A wide variety of companies and organizations use Hadoop for both research and production. For an overview of existing users please have a look at the following link:
Although the complexity of an information technology landscape rises immediately after introducing a Hadoop cluster within an enterprise (think about MapReduce and the fact that you won’t be able to simply fire SQL-queries) Hadoop is still the most widely used framework for processing enormous amounts of data quickly without having to spend enormous amounts of money to store it in a relational database. Several applications out there like Apache Hive will definitely help to overcome the “MapReduce” wall, while other third-party vendors are already offering proprietary plugins for Hadoop.