This blog tends to make the Big Data concept more concrete in terms of technology mechanisms representing the primary, common components of Big Data solutions, regardless of the open-source or vendor products used for implementation.
The study that I am currently doing from the Big Data Science program from Arcitura™ is dedicated to exploring the concept from a vendor-neutral angle. During this study various Big Data mechanisms were covered. In this blog I will be summarizing these mechanisms and how they act as the moving parts within Big Data solutions to provide features and functiones required. I will also be giving an specific example of a product dedicated to this particular mechanism based on the widely accepted Apache Hadoop Ecosystem.
The storage device mechanism provides the underlying data storage for persisting the “Big Data” datasets. This can exists as either a distributed file system or a database (NoSQL).
An example is the Hadoop Distributed File System (HFDS), which is an on-disk storage mechanism that is based on a distributed file system.
The processing engine mechanism is responsible for the actual processing of the data (executing the processing job). It utilizes a distributed parallel programming framework that enables it to process very large amount of data distributed across multiple nodes.
An example is MapReduce, which is a batch processing engine mechanism implemented by Hadoop that uses parallel processing deployed over commodity hardware clusters to process massive datasets.
The resource manager acts as a scheduler that schedules and prioritizes processing requests by managing and allocating available resources (processing engines).
An example is YARN (Yet Another Resource Negotiator), which is Hadoop’s implementation of the resource manager mechanism and is also known as MapReduce 2.0. The main component of YARN is the ResourceManager, this is the ultimate global authority that allocates the required resources.
Data Transfer Manager
A data transfer engine enables data to be moved in (ingress) or out (egress) of Big Data solutions. This data can reside on a distributed file system or within a database (NoSQL).
An example is Flume, which is Hadoop’s data transfer engine mechanism. Flume is a distributed, highly reliable and fault-tolerant system used for collecting, aggregating and distributing large amounts of line-based textual data and binary data from multiple data sources into a single storage device like the Hadoop Distributed File System (HDFS). Alternatively we can use Sqoop in Hadoop for moving relational structured data between a relational database and the Hadoop Distributed File System (HDFS).
The processing engine enables data to be queried and manipulated, but this functionality requires custom programming against the MapReduce framework. To simplify things the query engine mechanism abstracts the processing engine from end-users by providing a front-end user interface that can be used to query underlying data by using familiar languages that are easier to work with such as SQL. The query engine is used for basic processing functions which include sum, average, group by, join and sort.
An example is Hive, which is a query engine mechanism as implemented by Hadoop that provides the ability to query data stored in the Hadoop Distibuted File System (HDFS) using a SQL dialect called HiveQL. Under the hood, Hive makes use of the MapReduce framework for running HiveQL queries and acts as a data warehouse on top of Hadoop.
The analytics engine mechanism is able to process advances statistical and machine learning algorithms in search of identification of patterns and correlations and is used when the comparatively simple data functions of a query engine are insufficient.
An example is Mahout, which is the analytics engine mechanism in Hadoop and provides the implementation of various scalable machine learning algorithms that can be run using a MapReduce processing engine.
The workflow engine mechanism provides the ability to design and process a complex sequence of operations which can include the participation of other Big Data mechanisms.
An example is Oozie, which the workflow engine mechanism in Hadoop used for managing data processing jobs that is integrated with the MapReduce processing engine, Hive query engine en Sqoop data transfer engine.
Last but not least is the coordination engine, which is used to ensure operational consistency across all of the participating servers. Often used by the processing engine mechanism, the coordination engine coordinates data processing across a large number of servers by supporting distributed locks, distributed queues and providing a highly available registry for obtaining configuration information.
An example is ZooKeeper, which is a coordination engine mechanism implemented by Hadoop whose main responsibility is coordinating messages between the nodes of a cluster.