Hadoop ecosystem

ISHMEET KAUR

10 min readMay 25, 2020

Apache Hadoop is an open source framework for storing and processing large scale data, distributed across clusters of nodes.
To make Hadoop enterprise ready, numerous Apache software foundation projects are available to integrate and deploy with Hadoop. Each project has its own community of developers and release cycles.
The Hadoop ecosystem includes both open source Apache software projects and a wide range of commercial tools and solutions that integrate with Hadoop, to extend its capabilities.
Commercial Hadoop offerings include distributions from vendors like Hortonworks, Cloudera and MapR plus a variety of tools for specific Hadoop development and maintenance tasks.

5 functions of Hadoop ecosystem

1.Data management using HDFS, HBase and YARN

2.Data access with MapReduce, Hive and Pig.

3.Data ingestion and integration using Flume, Sqoop, Kafka, Storm.

4.Data monitoring using Ambari, Zookeeper and Oozie.

5.Data governance and security using Falcon, Ranger and Knox.

Data Management in Hadoop

Goal: To store and process vast quantities of data in a storage layer that scales linearly.

Hadoop Distributed File System (HDFS):

•Java-based file system

•provides scalable and reliable data storage

•spans large clusters of commodity servers.

Apache Hadoop YARN (Yet Another Resource Negotiator):

•Pre-requisite for Enterprise Hadoop

•Extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.

•Provides resource management

•Provides pluggable architecture for enabling wide variety of data access methods to operate on data stored in Hadoop.

Data Access in Hadoop

Goal: To interact with data in a wide variety of ways. Allows implementing complex business logic plus easy to use APIs supporting both structured and unstructured data.

MapReduce: Framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines in a reliable and fault-tolerant manner.

Apache Hive: Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.

Apache Pig: Provides scripting capabilities. Offers an alternative to MapReduce programming in Java. Pig scripts are automatically converted to MapReduce programs by the Pig engine.

Apache HIVE

•Hive evolved as a data warehousing solution built on top of Hadoop Map-Reduce framework.

•Enables data warehouse tasks like ETL, data summarization, query and analysis.

•Can access files stored in HDFS or other mechanisms like HBase.

•Hive provides SQL-like declarative language, called HiveQL, to run distributed queries on large volumes of data.

•In Hive, tables and databases are created first and then data is loaded into these tables for managing and querying structured data.

•Hive comes with a command-line shell interface which can be used to create tables and execute queries.

Built originally by Facebook and open sourced to Apache Foundation.
Hive engine compiles HiveQL queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries.
Components of Hive include HCatalog and WebHCat.

HCatalog: is a component of Hive. It is a table and storage management layer for Hadoop that enables users with different data processing tools, including Pig and MapReduce, to more easily read and write data on the grid.
WebHCat: provides a service that you can use to run Hadoop MapReduce (or YARN), Pig, Hive jobs or perform Hive metadata operations using an HTTP (REST style) interface.

APACHE PIG

•High level programming language useful for analyzing large data sets. Enables data analysts to write data analysis programs without the complexities of MapReduce.

•Developed at Yahoo and open sourced to Apache.

•Pig runs on Apache Hadoop YARN and makes use of MapReduce and the Hadoop Distributed File System (HDFS).

•Pig consists of 2 components:

1.Pig Latin: language to write scripts.

2.Runtime environment: Includes parser, optimizer, compiler and execution engine. Converts Pig scripts into a series of MapReduce jobs and executes them on the Hadoop MapReduce engine.

Ingestion and Integration using-Flume, Sqoop, Kafka, Storm

To quickly and easily load and process data from a variety of sources.

Apache Sqoop: Sqoop is an effective tool for transferring data from RDBMS and NoSQL data stores into Hadoop and to export data from HDFS to structural and non-structure databases.

Sqoop Import: The import tool imports individual tables from RDBMS to HDFS. Each row in a table is treated as a record in HDFS. All records are stored as text data in text files or as binary data in Avro and Sequence files.
Sqoop Export: The export tool exports a set of files from HDFS back to an RDBMS. The files given as input to Sqoop contain records, which are called as rows in table. Those are read and parsed into a set of records and delimited with user-specified delimiter

Apache Sqoop Connector Architecture-Various third party connectors for data stores,ranging from enterprise data warehouses (including Netezza, Teradata, and Oracle) to NoSQL stores (such as Couchbase) can be downloaded and installed with Sqoop.

Data transfer between Sqoop and external storage system is made possible with the help of Sqoop’s connectors.
Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2.
Generic JDBC connector for connecting to any database that supports Java’s JDBC protocol.
Sqoop provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

Apache Flume in Hadoop ecosystem-: Flume allows you to efficiently move large amounts of log data from many different sources to Hadoop.

The tool that provides a data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events from various sources to a centralized data store.

Flume supports large set of source and destination multiple sources:

1.’tail’: which pipes data from local file and write into HDFS via Flume, similar to Unix command ‘tail’)

2.System logs: Apache log4j (enable Java applications to write events to files in HDFS via Flume).

3.Social networking sites like Twitter, Facebook, Amazon

4.Destinations include HDFS and HBase.

The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.

Apache Flume architecture-Flume has flexible design based upon streaming data and it also has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.

Events from external source are consumed by Flume Data Source. The external source sends events to Flume source in a format that is recognized by the target source.
Flume Source receives an event and stores it into one or more channels. The channel acts as a store which keeps the event until it is consumed by the flume sink. This channel may use local file system in order to store these events.
Flume sink removes the event from channel and stores it into an external repository like e.g., HDFS. There could be multiple flume agents, in which case flume sink forwards the event to the flume source of next flume agent in the flow.

Apache Kafka: Publish-subscribe messaging system for real time event processing that offers strong durability, scalability and fault tolerance support.

Kafka is a distributed streaming platform:

1.It let’s you publish and subscribe to streams of records.

2. It let’s you store streams of records in a fault-tolerant way.

3.It let’s you process streams of records as they occur.

Kafka has four core APIs:

1.The Producer API allows an application to publish a stream records to one or more Kafka topics.

2.The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

3.The Stream API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

4.The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to

Apache Storm: Real time message computation system for processing fast, large streams of data. Not a queue.Real time message computation system for processing fast, large streams of data.

Features:

•Not a queue.

•Consumes streams of data and processes them in arbitrarily complex ways.

•Can integrate with any queuing any database system.

Example use case:

Kafka provides a distributed and robust queue in a publish and subscribe mode to pass high volume, real time data from one end-point to another. Storm extracts data from Kafka messages and allows manipulating data in many ways in a real time manner.

Apache Storm: Three abstractions in Apache Storm:

1.A spout is a source of streams in a computation. Typically can read from a queueing broker such as Kafka and also read from somewhere like the Twitter streaming API.

2.A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.

3.A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.

When to use which ingestion tool?

Apache Sqoop: When data is sitting in data stores like RDBMS, data warehouses or NoSQL data stores.

Apache Storm: Real time message computation system for processing fast, large streams of data. Not a queue.

Apache Flume: When moving bulk streaming data from various sources like web servers and social media.

Pros include:

•Flume is tightly integrated with Hadoop. It integrates with HDFS security very well.

•Flume is supported by a number of enterprise Hadoop providers.

•Supports built-in sources and sinks out of the box.

•Makes event filtering and transforming very easy. For example, you can filter out messages that you are not interested in the pipeline first before sending it through the network for performance gain.

Cons include: Not as scalable for adding additional consumers. Lesser message durability than Kafka.

Apache Kafka: is used for building real-time data pipelines and streaming apps.

Pros include:

•Better in scalability and message durability.

•Easy to add large number of consumers without affecting performance and down time.

Cons: General purpose and not tightly integrated with Hadoop. May have to write your own producers and consumers.

Data monitoring using-Ambari, Oozie and Zookeeper

To provision, manage, monitor and operate Hadoop clusters at scale.

Apache Ambari: An open source installation life cycle management, administration and monitoring system for Apache Hadoop clusters.

•Provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Ambari enables System Administrators to:

Provision a Hadoop Cluster- Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Ambari handles configuration of Hadoop services for the cluster.

2. Manage a Hadoop Cluster- Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.

3. Monitor a Hadoop Cluster-

a. Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.

b. Ambari leverages Ambari Metrics System for metrics collection.

c. Ambari leverages Ambari Alert Framework for system alerting and will notify you when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

Apache Oozie: Workflow engine for Apache Hadoop

Oozie Java Web application used to schedule Apache Hadoop jobs.

Can combine multiple complex jobs to be run in a sequential order to achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed to run parallel to each other.

Responsible for triggering the workflow actions, which in turn uses the Hadoop execution engine to actually execute the task.

Tightly integrated with Hadoop stack supporting various Hadoop jobs like MapReduce, Hive, Pig, Sqoop as well as system-specific jobs like Java and Shell.

Oozie workflows definitions are written in hPDL (a XML Process Definition Language)

Apache Zookeeper: Open source distributed software that provides co-ordination and operational services between distributed processes on a Hadoop cluster of nodes. Zookeeper provides:

•Naming service: Identify nodes in a cluster by name.

•Configuration management: Synchronizes configuration between nodes, ensuring consistent configuration.

•Process synchronization: Zookeeper coordinates the starting and stopping of multiple nodes in the cluster. This ensures that all processing occurs in the intended order.

•Self-election: Zookeeper can assign a “leader” role to one of the nodes. This leader/master handles all client requests on behalf of the cluster. If the leader node fails, another leader will be elected from the remaining nodes.

•Reliable messaging: Fulfills need for communication between and among the nodes in the cluster specific to the distributed application. Zookeeper offers a publish/subscribe capability that allows the creation of a queue. This queue guarantees message delivery even in the case of a node failure.