Big Data: the buzzword for massive amounts of data of our in­creas­ingly digitised lives. Companies around the world are busy de­vel­op­ing more efficient methods for compiling elec­tron­ic data on a massive scale and saving this on enormous storage ar­chi­tec­tures where it’s then sys­tem­at­ic­ally processed. Masses of data measured in petabytes or exabytes are therefore no longer a rarity these days. But no single system is able to ef­fect­ively process volumes of data of this magnitude. For this reason, Big Data analyses require software platforms that make it possible to dis­trib­ute complex computing tasks onto a large number of different computer nodes. One prominent solution is Apache Hadoop, a framework that provides a basis for several dis­tri­bu­tions and Big Data suites.

Compute Engine
The ideal IaaS for your workload
  • Cost-effective vCPUs and powerful dedicated cores
  • Flex­ib­il­ity with no minimum contract
  • 24/7 expert support included

What is Hadoop?

Apache Hadoop is a Java-based framework used for a diverse range of software com­pon­ents that makes it possible to separate tasks (jobs) into sub processes. These are divided onto different nodes of a computer cluster where they are then able to be sim­ul­tan­eously run. For large Hadoop ar­chi­tec­tures, thousands of in­di­vidu­al computers are used. This concept has the advantage that each computer in the cluster only has to provide a fraction of the required hardware resources. Working with large quant­it­ies of data thus does not ne­ces­sar­ily require any high-end computers and instead can be carried out through a variety of cost-effective servers.

The open source project, Hadoop, was initiated in 2006 by the developer, Doug Cutting, and can be traced back to Google’s MapReduce algorithm. In 2004, the search engine provider published in­form­a­tion on a new tech­no­logy that made it possible to par­al­lel­ise complex computing processes on the basis of large data quant­it­ies with the help of a computer cluster. Cutting, who’d spent time at Excite, Apple Inc. and Xerox Parc and already had made a name for himself with Apache Lucene, soon re­cog­nised the potential of MapReduce for his own project and received support from his employer at the time, Yahoo. In 2008, Hadoop became Apache Software Found­a­tion’s top-level project, and the framework finally achieved the release status 1.0.0 in 2011.

In addition to Apache’s official release, there are also different forks of the software framework available as business-ap­pro­pri­ate dis­tri­bu­tions that are provided to customers of various software providers. One form of support for Hadoop is offered through Doug Cutting’s current employer, Cloudera, which provides an ‘en­ter­prise-ready’ open source dis­tri­bu­tion with CDH. Hor­ton­works and Teradata feature similar products, and Microsoft and IBM have both in­teg­rated Hadoop into their re­spect­ive products, the cloud service Azure and In­foSphere Biglnsights.

Hadoop Ar­chi­tec­ture: set-up and basic com­pon­ents

Generally, when one refers to Hadoop, this means an extensive software packet —also sometimes called Hadoop ecosystem. Here, the system’s core com­pon­ents (Core Hadoop) as well as various ex­ten­sions are found (many carrying colorful names like Pig, Chukwa, Oozie or ZooKeeper) that add various functions to the framework for pro­cessing large amounts of data. These closely related projects also hail from the Apache Software Found­a­tion.

Core Hadoop con­sti­tutes the basis of the Hadoop ecosystem. In version 1, integral com­pon­ents of the software core include the basis module Hadoop Common, the Hadoop Dis­trib­uted File System (HDFS) and a MapReduce Engine, which was replaced by the cluster man­age­ment system YARN (also referred to as MapReduce 2.0) in version 2.3. This set-up elim­in­ates the MapReduce algorithm from the actual man­age­ment system, giving it the status of a YARN-based plugin.

Hadoop Common

The module Hadoop Common provides all of the other framework’s com­pon­ents with a set of basic functions. Among these are:

  • Java archive files (JAR), which are needed to start Hadoop,
  • Libraries for seri­al­ising data,
  • In­ter­faces for accessing the file system of the Hadoop ar­chi­tec­ture as well as the remote- procedure-call com­mu­nic­a­tion located inside the computer cluster.

Hadoop Dis­trib­uted File System (HDFS)

HDFS is a highly available file system that is used to save large quant­it­ies of data in a computer cluster and is therefore re­spons­ible for storage within the framework. To this end, files are separated into blocks of data and are then re­dund­antly dis­trib­uted to different nodes; this is done without any pre­defined or­gan­isa­tion­al scheme. According to the de­velopers, HDFS is able to manage files numbering in the millions.

The Hadoop cluster generally functions according to the master/slave model. The ar­chi­tec­ture of this framework is composed of a master node to which numerous sub­or­din­ate ‘slave’ nodes are assigned. This principle is found again in the HDFS structure, which is based on a NameNode and various sub­or­din­ate DataNodes. The NameNode manages all metadata for the file system and for the directory struc­tures and files. The actual data storage takes place on the sub­or­din­ate DataNodes. In order to minimise data loss, these files are separated into single blocks and saved multiple times on different nodes. The standard con­fig­ur­a­tion is organised in such a way that each data block is saved in trip­lic­ate.

Every DataNode sends the NameNode a sign of life, known as a ‘heartbeat’, in regular intervals. Should this signal fail to ma­ter­i­al­ise, the NameNode declares the cor­res­pond­ing slave to be ‘dead’, and with the help of the data copies, ensures that enough copies of the data block in question are available in the cluster. The NameNode occupies a central role within the framework. In order to keep it from becoming a ‘single point of failure’, it’s common practice to provide this master node with a Sec­ond­ary­Na­meN­ode. This is re­spons­ible for recording any changes made to meta data, making it possible to restore the HDFS’ centrally con­trolled instance.

During the trans­ition phase from Hadoop 1 to Hadoop 2, HDFS added a further security system: NameNode HA (high avail­ab­il­ity) adds another failsafe mechanism to the system that auto­mat­ic­ally starts a backup component whenever a NameNode crash occurs. What’s more, a snapshot function enables the system to be set back to its prior status. Ad­di­tion­ally, the extension, Fed­er­a­tion, is able to operate multiple NameNodes within a cluster.

MapReduce-Engine

Ori­gin­ally developed by Google, the MapReduce algorithm, which is im­ple­men­ted in the framework as an autonom­ous engine in Hadoop Version 1, is a further main component of the Core Hadoop. This engine is primarily re­spons­ible for managing resources as well as con­trolling and mon­it­or­ing computing processes (job schedul­ing/mon­it­or­ing). Here, data pro­cessing largely relies on the phases ‘map’ and ‘reduce’, which make it possible to directly process data at the data locality. This decreases the computing time and network through­put. As a part of the map phase, complex computing processes (jobs) are separated into in­di­vidu­al parts and then dis­trib­uted by a so-called Job­Track­er (located on the master node) to numerous slave systems in the cluster. There, TaskTrack­ers ensure that the sub processes are processed in a par­al­lel­ised manner. In the sub­sequent reduce phase, the interim results are collected by the MapReduce engine and then compiled as one single overall result.

While Master Nodes generally contain the com­pon­ents NameNode and Job­Track­er, a DataNode and TaskTrack­er work on each sub­or­din­ate slave. The following graphic displays the basic structure of a Hadoop ar­chi­tec­ture (according to version 1) that’s split into MapReduce layers and HDFS layers.

With the release of Hadoop version 2.3, the MapReduce engine was fun­da­ment­ally over­hauled. The result is the cluster man­age­ment system YARN/MapReduce 2.0, which decoupled resource and task man­age­ment (job schedul­ing/mon­it­or­ing) from MapReduce and so opened the framework to a new pro­cessing model and a wide range of Big Data ap­plic­a­tions.

YARN/MapReduce 2.0

With the in­tro­duc­tion of the YARN module (‘Yet Another Resource Ne­go­ti­at­or’) starting with version 2.3, Hadoops ar­chi­tec­ture went through a fun­da­ment­al change that marks the trans­ition from Hadoop 1 to Hadoop 2. While Hadoop 1 only offers MapReduce as an ap­plic­a­tion, it enables resource and task man­age­ment to be decoupled from the data pro­cessing model, which allows a wide variety of Big Data ap­plic­a­tions to be in­teg­rated into the framework. Con­sequently, MapReduce under Hadoop 2 is only one of many possible ap­plic­a­tions for accessing data that can be executed in the framework. This means that the framework is more than a simple MapReduce runtime en­vir­on­ment; YARN assumes the role of a dis­trib­uted operating system for resource man­age­ment of Big Data ap­plic­a­tions.

The fun­da­ment­al changes to Hadoop ar­chi­tec­ture primarily affect both trackers of the MapReduce engine, which no longer exist in Hadoop 2 as autonom­ous com­pon­ents. Instead, the YARN module relies on three new entities: the Re­source­M­an­ager, the No­de­M­an­ager, and the Ap­plic­a­tion­Mas­ter.

  • Re­source­M­an­ager: the global Re­source­M­an­ager acts as the highest authority in the Hadoop ar­chi­tec­ture (Master) that’s assigned various No­de­M­an­agers as ‘slaves’. Its re­spons­ib­ilites include con­trolling computer clusters, or­ches­trat­ing the dis­tri­bu­tion of resources onto the sub­or­din­ate No­de­Mangers, and dis­trib­ut­ing ap­plic­a­tions. The Re­source­M­an­ager knows where the in­di­vidu­al slave systems within the cluster are found and which resources these are able to provide. The Re­sourceS­ched­uler is one par­tic­u­larly important component of the Re­source­Manger; this decides how available resources in the cluster are dis­trib­uted.
  • No­de­M­an­ager: there is a No­de­M­an­ager located on each of the computer cluster’s nodes. This takes in the slave’s position in Hadoop 2’s in­fra­struc­ture and in this way acts as a command recipient of the Re­source­M­an­ager. When a No­de­M­an­ager is started on a node in the cluster, it then registers with the Re­source­M­an­ager and sends a ‘sign of life’, the heartbeat, in regular intervals. Each No­de­M­an­ager is re­spons­ible for the resources of its own nodes and provides the cluster with a portion of these. How these resources are used is decided by the Re­source­M­an­ager’s Re­sourceS­ched­uler.
  • Ap­plic­a­tion­Mas­ter: each node within the YARD system contains an Ap­plic­a­tion­Mas­ter that requests resources from the Re­source­M­an­ager and is allocated these in the form of con­tain­ers. Big Data ap­plic­a­tions from the Ap­plic­a­tion­Mas­ter are executed and observed on these con­tain­ers.

Here’s a schematic depiction of the Hadoop 2 ar­chi­tec­ture:

Should a Big Data ap­plic­a­tion need to be executed on Hadoop 2, then there are generally three actors involved:

  • A client
  • A Re­source­M­an­ager and,
  • One or more No­de­M­an­agers.

First off, the client issues the Re­source­M­an­ager an order, or job, that’s to be started by a Big Data ap­plic­a­tion in the cluster. This then allocates a container. In other words: the Re­source­M­an­ager reserves the cluster’s resources for the ap­plic­a­tion and contacts a No­de­M­an­ager. The contacted No­de­M­an­ager starts the container and executes an Ap­plic­a­tion­Mas­ter within it. This latter component is re­spons­ible for mon­it­or­ing and executing the ap­plic­a­tion.

The Hadoop ecosystem: optional expansion com­pon­ents

In addition to the system’s core com­pon­ents, the Hadoop ecosystem en­com­passes a wide range of ex­ten­sions that fa­cil­it­ate separate software projects and make sub­stan­tial con­tri­bu­tions to the func­tion­al­ity and flex­ib­il­ity of the software framework. Due to the open source code and numerous in­ter­faces, these optional add-on com­pon­ents can be freely combined with the system’s core functions. The following shows a list of the most popular projects in the Hadoop ecosystem:

  • Ambari: the Apache project Ambari was initiated by the Hadoop dis­trib­uter Hor­ton­works and adds in­stall­a­tion and man­age­ment tools to the ecosystem that fa­cil­it­ate providing IT resources and managing and mon­it­or­ing Hadoop com­pon­ents. To this end, Apache Ambari offers a step-by-step wizard for in­stalling Hadoop services onto any amount of computers within the cluster as well as a man­age­ment function with which services can be centrally started, stopped, or con­figured. A graphical user interface informs users on the status of the system. What’s more, the Ambari Metrics System and the Ambari Alert Framework enable metrics to be recorded and alarm levels to be con­figured.
  • Avro: Apache Avro is a system for seri­al­ising data. Avro uses JSON in order to define data types and protocols. The actual data, on the other hand, is seri­al­ised in a compact binary format. This serves as a data transfer format for the com­mu­nic­a­tion between the different Hadoop nodes as well as the Hadoop services and client pro­grammes.
  • Cassandra: written in Java, Apache Cassandra is a dis­trib­uted database man­age­ment system for large amounts of struc­tured data that follows a non- re­la­tion­al approach. This kind of software is also referred to as NoSQL databases. Ori­gin­ally developed by Facebook, the goal of the open source system is to achieve high scalab­il­ity and re­li­ab­il­ity in large, dis­trib­uted ar­chi­tec­tures. Storing data takes place on the basis of key-value relations.
  • HBase: HBase also is an open source NoSQL database that enables real-time writing and reading access of large amounts of data within a computer cluster. HBase is based on Google’s high-per­form­ance database system, BigTable. In com­par­is­on to other NoSQL databases, HBase is char­ac­ter­ised by high data con­sist­ency.
  • Chukwa: Chukwa is a data ac­quis­i­tion and analysis system that relies on the HDFS and the MapReduce of the Hadoop Big Data framework; it also allows real-time mon­it­or­ing as well as data analyses in large, dis­trib­uted systems. In order to achieve this, Chukwa uses agents that run on every observed node and collect log files of the ap­plic­a­tions that run there. These files are sent to so-called col­lect­ors and then saved in the HDFS.
  • Flume: Apache Flume is another service that was created for col­lect­ing, ag­greg­at­ing, and moving log data. In order to stream data for storage and analysis purposes from different sources onto HDFS, flume im­ple­ments transport formats like Apache Thrift or Avro.
  • Pig: Apache Pig is a platform for analysing large amounts of data that the high-level pro­gram­ming language, Pig Latin, makes available to Hadoop users. Pig makes it possible to describe the flow of MapReduce jobs on an abstract level. Following this, MapReduce requests are no longer created in Java; instead, they’re pro­grammed in the much more efficient Pig Latin. This makes managing MapReduce jobs much simpler. For example, this language allows users to un­der­stand the parallel execution of complex analyses. Pig Latin was ori­gin­ally developed by Yahoo. The name is based on the approach of the software: like an ‘omnivore’, Pig is designed to process all types of data (struc­tured, un­struc­tured, or re­la­tion­al).
  • Hive: with Apache Hive, a data warehouse is added to Hadoop. These are cent­ral­ised data bases employed for different types of analyses. The software was developed by Facebook and is based on the MapReduce framework. With HiveQL, Hive is endowed with a SQL-like syntax that makes it possible to call up, compile, or analyse data saved in HDFS. To this end, Hive auto­mat­ic­ally trans­lates SQL-like requests into MapReduce Jobs.
  • HCatalog: a core component of Apache Hive is HCatalog, a meta data and chart man­age­ment system that makes it possible to store and process data in­de­pend­ently of both format or structure. To do this, HCatalog describes the data’s structure and so makes use easier through Hive or Pig.
  • Mahout: Apache Mahout adds easily ex­tens­ible Java libraries that can be used for data mining and math­em­at­ic ap­plic­a­tions for machine learning to the Hadoop ecosystem. Al­gorithms that can be im­ple­men­ted with Mahout in Hadoop enable op­er­a­tions like clas­si­fic­a­tions, clus­ter­ing, and col­lab­or­at­ive filtering. When applied, Mahout can be used, for instance, to develop re­com­mend­a­tion services (customers who bought this item also bought…).
  • Oozie: the optional workflow component, Oozie, makes it possible to create process chains, automate these, and also have them executed in a time con­trolled manner. This allows Oozie to com­pensate for deficits in Hadoop 1’s MapReduce engine.
  • Sqoop: Apache Sqoop is a software component that fa­cil­it­ates the import and export of large data quant­it­ies between the Hadoop Big Data framework and struc­tured data storage. Nowadays, company data is generally stored within re­la­tion­al databases. Sqoop makes it possible to ef­fi­ciently exchange data between these storage systems and the computer cluster.
  • ZooKeeper: Apache ZooKeeper offers services for co­ordin­at­ing processes in the cluster. This is done by providing functions for storing, dis­trib­ut­ing, and updating con­fig­ur­a­tion in­form­a­tion.

Hadoop for busi­nesses

Given that Hadoop clusters can be set up for pro­cessing large data volumes with the help of standard PCs, the Big Data framework has become a popular solution for numerous companies. Some of these include the likes of Adobe, AOL, eBay, Facebook, Google, IBM, LinkedIn, Twitter, and Yahoo. In addition to the pos­sib­il­ity of being able to easily save and sim­ul­tan­eously process data on dis­trib­uted ar­chi­tec­ture, Hadoop’s stability, ex­pand­ab­il­ity, and extensive range of functions are further ad­vant­ages of this open source software.

Go to Main Menu