Hadoop/MapReduce Notes


About

Big Data:
- Operational Big Data (MongoDB: capturing and storing data in realtime)
- Analytical Big Data (Massively Parallel Processing: MapReduce)

1. Introduction

MapReduce provides a new method of analyziong data.
MapReduce is complementary to SQL.
A system based on MapReduce can be scaled up from single servers to thousands of high and low end machines.


1.1. MapReduce Algorithm

MapReduce algorithm (Google) divide a big task into small parts and assign those parts to many computers connected over the network and collect the results from the final result dataset.

1.2. Hadoop

Hadoop is an open source project implementing MapReduce algorithm.
The data is processed in parallel on different CPU nodes. 
Haddop framework is capable enough to develop applications capable of running clusters of computers and they could perform complete statistical analysis for a huge amount of data.


2. Hadoop

Hadoop is an Apache open source framework. It's written in Java. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers.

2.1 Architecture

Hadoop ---- ----> Hadoop Common: Java libraries and utilities required by other modules
                    |
                     ----> Hadoop YARN: A framework for job scheduling and cluster resource management
                    |
                     ----> Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
                    |
                     ----> Hadoop MapReduce: This is a YARN based system for parallel processing of large dataset.
Since 2012 the term Hadoop refers not just to the above architecture but include more software such as:
- Apache Pig
- Apache Hive
- Apache HBase
- Apache Spark
- And more..

2.2 How does MapReduce work ?


Just to remember, MapReduce is a software framework roe easily writing applications which process big amount of data in parallel on large clusters of commodity hardware in a reliable, fault tolerant manner.

Hadoop performs two different tasks:
    - The Map task: 
This is the first task. It takes input data and confer it into a set of data, where individual elements are broken down into tuples (key/value paris)
    - The Reduce task:
It takes the output of the Map task (key/value paris) as input and combines those data tuples into smaller sets of tuples (it reduces them). 

Typically both the input and the output are stored in a file system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed ones.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node.
    - The master (JobTracker): Is responsible for resource management, tracking resource consumption/availability and scheduling the jobs components tasks on the slaves, monitoring them and re-executing the failed tasks.
The JobTracker is a single point of failure for the Hadoop MapReduce service. This is means that the JobTracker goes down, all running jobs are halted.
    - The salves (TaskTracker): execute the tasks as directed by the master and provide task-status information to the master periodically.

2.3 Hadoop Distributed File System (HDFS)

Hadoop can work with any file system but the most common used file system is HDFS.
HDFS is based on Google File System (GFS). It is a distributed file system that is designed to run on large cluster of small computer in a reliable, fault-tolerant manner.
It use a master slave/architecture:
    - The master (NameNode): Manages the file system metadata
    - The slave(s) (DataNode(s)): Store the actual data.

A file is split into several blocks and those blocks are stored in a set of DataNodes. 
The NameNode determines the mapping of blocks to the DataNodes.
The DataNodes takes care of read and write operation with the file system. They also take care of block creation, deletion and replication based of instruction given by NameNode.

2.4 General Mechanism 


Stage 1
A job is submitted to Hadoop (the job client) by specifying:
    1. The location of the input and output file in the distributed file system
    2. The java classes (jar file) containing the implementation of map and reduce function
    3. The job configuration by setting different parameters specific to the job.

Stage 2
The job client submits the job (jar/executables..) and configuration to the JobTracker which assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information the the job-client.

Stage3
The TaskTrackers on different nodes execute the task as per MapReduce implementation.
Output of the reduce function is stored into the output files on the file system.

Sources
http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm



No comments:

Post a Comment