the core component of Hadoop are yarn, hdfs and MapReduce
hdfs
hdfs is a file system that manage files that located in multi-webstations. It offers possibilities of stream data access and large data set processing. the advantages are high fault tolerancy, so that single node drops down does that influence the whole system performance
components
NameNode
NameNode is the manager node that responsible for namespace of file-system, Data block mapping information, cluster configuration and copy of memory blocks. NameNode saves meta-data in memory, which contains file information, corresponding file block information and information of file blocks on DataNode files on NameNode:
- fsimage Metadata mirror file(temporary metadata information)
- edits operation logs
- fstime
save last checkpoint time
secondary NameNode
assisting NameNode; merge fsimage and fsedits file; backup nameNode
DataNode
main purpose is to store data files using blocks(defaul size is 128MB) with replication
client
purpose: file slicing, communicate with DataNode, access and manage hdfs client will cache data locally until data reaches one block size, one data block will be assigned on DataNode, then data will be wrote into this block
disadvantages
bad useer cases:
- large quantities of small files (too many map task->much cost for process management)
- high real-time data processing (high latency->Hbase is a better choice)
- random read and random file changes not possible (write is only possible at the end of file, hdfs does not support write-operation by multi-user and changes in random position)
shell operation
- bin/hdfs dfs -ls -R /
recursively list file in hdfs - bin/hdfs dfs -rm -R / recursively delete file in folder
- bin/hdfs dfs -mkdir / create folder
- bin/hdfs dfs -put NOTICE.txt / upload file
- bin/hdfs dfs -get /folder /folder download
- bin/hdfs dfs -text /NOTICE.txt |more open file contents
- bin/hdfs dfs -appendToFile Notice.txt /Notice.txt append to file
JavaAPI
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
public class HDFSTest {
public static void main(String[] args)throws Exception{
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://192.168.1.91:9000"),conf);
// fileSystem.mkdirs(null);
// fileSystem.delete(null);
// fileSystem.open(null);
// fileSystem.create(null);
// fileSystem.listStatus(files);
System.out.println(fileSystem);
}
} ## yarn yarn is a resource manager
container
container is an abstraction of calculation resource(CPU, memory…), it is started, managed and watched by NodeManager and assigned by ResourceManager
ResourceManager
there is only one resourceManager, has two main components: Scheduler and ApplicationManager.
- Scheduler distribute resource according to clients’ need;
- ApplicationManager manage the requests from clients.Every request from client will create an ApplicationMaster, this ApplicationMaster applies an container from ResourceManager, after getting the resource the ApplicationMaster starts on the applied container( so-called distributed computing). ApplicationMaster splits the job into small tasks and distribute the tasks to other containers. When job is done ApplicationMaster will inform ResourceManager to free the container resources
NodeManager
NodeManager is the proxy of ResourceManager on each node (can be one or more containers), managing resource usages and reporting to ResourceManager
ApplicationMaster
ApplicationMaster goes through lifecycle of one application(job), works together with NodeManager to complete the application
MapReduce
Map is to get related data from source data; Reduce is to combine these mapped related datas principle of MapReduce is to utilize input key/value set to generate an intermediate key/value set using map function, MapReduce gather all intermediate value with intermediate key together and transfer this to Reduce function, Reduce combines the values to generate a smaller output key/value set the Map and Reduce functions should be designed by user
implementation example
input of Map func: (0 ,0067011990999991950051507+0000+) (1 ,0043011990999991950051512+0022+) (2 ,0043011990999991950051518-0011+) (3 ,0043012650999991949032412+0111+) (4 ,0043012650999991949032418+0078+) (5 ,0067011990999991937051507+0001+) (6 ,0043011990999991937051512-0002+) (7 ,0043011990999991945051518+0001+) (8 ,0043012650999991945032412+0002+) (9 ,0043012650999991945032418+0078+)
output of Map func and input of Reduce func: (1950, 0) (1950, 22) (1950, -11) (1949, 111) (1949, 78) (1937, 1) (1937, -2) (1945, 1) (1945, 2) (1945, 78) output of Reduce func: (1950, [0, 22, –11]) (1949, [111, 78]) (1937, [1, -2]) (1945, [1, 2, 78])