hadoop

the core component of Hadoop are yarn, hdfs and MapReduce

hdfs

hdfs is a file system that manage files that located in multi-webstations. It offers possibilities of stream data access and large data set processing. the advantages are high fault tolerancy, so that single node drops down does that influence the whole system performance

components

NameNode

NameNode is the manager node that responsible for namespace of file-system, Data block mapping information, cluster configuration and copy of memory blocks. NameNode saves meta-data in memory, which contains file information, corresponding file block information and information of file blocks on DataNode files on NameNode:

fsimage Metadata mirror file(temporary metadata information)
edits operation logs
fstime save last checkpoint time
secondary NameNode

assisting NameNode; merge fsimage and fsedits file; backup nameNode

DataNode

main purpose is to store data files using blocks(defaul size is 128MB) with replication

client

purpose: file slicing, communicate with DataNode, access and manage hdfs client will cache data locally until data reaches one block size, one data block will be assigned on DataNode, then data will be wrote into this block

disadvantages

bad useer cases:
large quantities of small files (too many map task->much cost for process management)
high real-time data processing (high latency->Hbase is a better choice)
random read and random file changes not possible (write is only possible at the end of file, hdfs does not support write-operation by multi-user and changes in random position)
shell operation
bin/hdfs dfs -ls -R /
recursively list file in hdfs
bin/hdfs dfs -rm -R / recursively delete file in folder
bin/hdfs dfs -mkdir / create folder
bin/hdfs dfs -put NOTICE.txt / upload file
bin/hdfs dfs -get /folder /folder download
bin/hdfs dfs -text /NOTICE.txt |more open file contents
bin/hdfs dfs -appendToFile Notice.txt /Notice.txt append to file

JavaAPI

import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;

public class HDFSTest {
    public static void main(String[] args)throws Exception{
        Configuration conf = new Configuration();
        FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.1.91:9000"),conf);
//		fileSystem.mkdirs(null);
//		fileSystem.delete(null);
//		fileSystem.open(null);
//		fileSystem.create(null);
//		fileSystem.listStatus(files);
        System.out.println(fileSystem);
    }
} ## yarn yarn is a resource manager

container

container is an abstraction of calculation resource(CPU, memory…), it is started, managed and watched by NodeManager and assigned by ResourceManager

ResourceManager

there is only one resourceManager, has two main components: Scheduler and ApplicationManager.

Scheduler distribute resource according to clients’ need;
ApplicationManager manage the requests from clients.Every request from client will create an ApplicationMaster, this ApplicationMaster applies an container from ResourceManager, after getting the resource the ApplicationMaster starts on the applied container( so-called distributed computing). ApplicationMaster splits the job into small tasks and distribute the tasks to other containers. When job is done ApplicationMaster will inform ResourceManager to free the container resources

NodeManager

NodeManager is the proxy of ResourceManager on each node (can be one or more containers), managing resource usages and reporting to ResourceManager

ApplicationMaster

ApplicationMaster goes through lifecycle of one application(job), works together with NodeManager to complete the application

MapReduce

Map is to get related data from source data; Reduce is to combine these mapped related datas principle of MapReduce is to utilize input key/value set to generate an intermediate key/value set using map function, MapReduce gather all intermediate value with intermediate key together and transfer this to Reduce function, Reduce combines the values to generate a smaller output key/value set the Map and Reduce functions should be designed by user

implementation example

input of Map func: (0 ,0067011990999991950051507+0000+) 　　(1 ,0043011990999991950051512+0022+) 　　(2 ,0043011990999991950051518-0011+) 　　(3 ,0043012650999991949032412+0111+) 　　(4 ,0043012650999991949032418+0078+) 　　(5 ,0067011990999991937051507+0001+) 　　(6 ,0043011990999991937051512-0002+) 　　(7 ,0043011990999991945051518+0001+) 　　(8 ,0043012650999991945032412+0002+) 　　(9 ,0043012650999991945032418+0078+)

output of Map func and input of Reduce func: (1950, 0) 　　(1950, 22) 　　(1950, -11) 　　(1949, 111) 　　(1949, 78) 　　(1937, 1) 　　(1937, -2) 　　(1945, 1) 　　(1945, 2) 　　(1945, 78) output of Reduce func: 　　(1950, [0, 22, –11]) 　　(1949, [111, 78]) 　　(1937, [1, -2]) 　　(1945, [1, 2, 78])

Gitalking ...

hdfs

components

NameNode

secondary NameNode

DataNode

client

disadvantages

shell operation