1. What is big data
Since the beginning of this century, especially after 2010, with the development of the Internet, especially the mobile Internet, the growth of data has exploded. It has been difficult to estimate how much data is stored in electronic devices around the world, and describe the data of the data system. The unit of measurement for volume has been increasing from MB (1MB is approximately equal to one million bytes), GB (1024MB), and TB (1024GB). At present, PB (equal to 1024TB)-level data systems are already very common. The amount of data, social networking sites, scientific calculations, securities trading, website logs, and sensor networks continues to increase. The total amount of data owned in China has long exceeded the ZB (1ZB=1024EB, 1EB=1024PB) level.
The traditional method of data processing is: as the amount of data increases, the hardware indicators are constantly updated, and measures such as more powerful CPUs and larger capacity disks are adopted. But the reality is: the speed of data increase far exceeds that of a single machine. Increased speed of computing and storage capacity. The "big data" processing method is: the use of multiple machines and multiple nodes to process large amounts of data, and the use of this new processing method requires a new big data system to ensure that the system needs to handle the communication between multiple nodes A series of issues such as coordination and data separation.
In short, the use of multi-machine and multi-node methods to solve the communication coordination, data coordination, and calculation coordination problems of each node, and the way to process massive amounts of data, is the "big data" thinking. Its characteristic is that as the amount of data continues to increase, the number of machines can be increased and horizontally expanded. A big data system can reach tens of thousands of machines or more.
2. hadoop overview
Hadoop is a software platform for the development and operation of large-scale data processing. It is an open source software framework of Apache that uses Java language to implement distributed computing on massive data in a cluster of large numbers of computers.
The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides the storage of massive data, and MapReduce provides the calculation of data. Hadoop release in addition to the community hadoop the Apache, Cloudera, Hortonworks, IBM , INTEL , Huawei, large fast search, etc., provides its own commercial version. The commercial version mainly provides professional technical support, which is especially important for some large enterprises.
DK.Hadoop is a deeply integrated and recompiled version of HADOOP that can be released separately. A necessary component for independent deployment of FreeRCH (integrated development framework for big fast and big data). DK.HADOOP integrates NOSQL database, which simplifies the programming between file system and non-relational database; DK.HADOOP improves the cluster synchronization system, making HADOOP's data processing more efficient.
3. Detailed explanation of hadoop development technology
1. Hadoop operating principle
Hadoop is an open source distributed parallel programming framework that can run on large-scale clusters. Its core designs include MapReduce and HDFS. Based on Hadoop, you can easily write distributed parallel programs that can handle massive amounts of data and run them on large-scale computer clusters composed of hundreds of nodes.
It is relatively simple to write distributed parallel programs based on the MapReduce computing model. The programmer's main job is to design and implement Map and Reduce classes. Other complex issues in parallel programming, such as distributed storage, job scheduling, load balancing, fault-tolerant processing, and network Communication, etc., are handled by the MapReduce framework and the HDFS file system, so programmers don't have to worry about it at all. In other words, programmers only need to care about their own business logic, and do not need to care about the underlying communication mechanism and other issues to write complex and efficient parallel programs. If the difficulty of distributed parallel programming is enough to make ordinary programmers daunting, the emergence of open source Hadoop has greatly reduced its threshold.
2. The principle of Mapreduce
The processing process mainly involves the following four parts:
Client process: used to submit Map-reduce job job;
JobTracker process: it is a Java process, and its main class is JobTracker;
TaskTracker process: It is a Java process, and its main class is TaskTracker;
HDFS: Hadoop distributed file system, used to share Job-related files among various processes;
The JobTracker process is used as the master control to schedule and manage other TaskTracker processes. The JobTracker can run on any computer in the cluster. Generally, the JobTracker process is configured to run on the NameNode node. TaskTracker is responsible for executing the tasks assigned by the JobTracker process. It must run on the DataNode, that is, the DataNode is both a data storage node and a computing node. JobTracker distributes Map tasks and Reduce tasks to idle TaskTrackers, allows these tasks to run in parallel, and is responsible for monitoring the running status of tasks. If a TaskTracker fails, JobTracker will transfer its responsible tasks to another idle TaskTracker to run again.
3. HDFS storage mechanism
Hadoop's distributed file system HDFS is a virtual distributed file system built on the Linux file system. It consists of a management node (NameNode) and N data nodes (DataNode). Each node is an ordinary one. computer. In use, it is very similar to the file system on a single machine that we are familiar with. It can also create directories, create, copy, delete files, and view file contents. But the underlying implementation is to cut files into blocks, and then these blocks are stored on different DataNodes. Each Block can also be copied and stored on different DataNodes to achieve fault tolerance and disaster tolerance. The NameNode is the core of the entire HDFS. It maintains some data structures to record how many blocks each file is cut into, which DataNodes these blocks can be obtained from, and the status of each DataNode and other important information.
HDFS data block
Each disk has a default data block size, which is the basic unit of disk read and write. The file system built on a single disk manages the blocks in the file system through disk blocks. The blocks in the file system are generally An integer multiple of the disk block. The disk block is generally 512 bytes. HDFS also has the concept of a block, and the default is 64MB (the size of the data processed by a map). The files on HDFS are also divided into multiple blocks of block size, and other The difference in the file system is that files smaller than a block size in HDFS do not occupy the entire block space.
Task granularity-data slice (Splits)
When cutting the original large data set into small data sets, usually let the small data set be smaller than or equal to the size of a block in HDFS (the default is 64M), so as to ensure that a small data set is located on a computer, which is convenient for local calculation. When there are M small data sets to be processed, M Map tasks are started. Note that these M Map tasks are distributed on N computers and run in parallel, and the number of Reduce tasks R can be specified by the user.
The first obvious benefit brought by HDFS with block storage is that the size of a file can be larger than the capacity of any disk in the network, and data blocks can be stored on any disk in the disk. The second simplifies the design of the system and controls The unit is set to block, which simplifies storage management. It is relatively easy to calculate how many blocks a single disk can store. At the same time, it also eliminates concerns about metadata, such as permission information, which can be managed by other systems separately.
4. Give a simple example to illustrate
The operating mechanism of MapReduce
Take, for example, a program that counts the number of occurrences of each word in a text file,
<offset position of the line in the file, a line in the file>, after mapping by the Map function, a batch of intermediate results <words, number of occurrences> are formed, and the Reduce function can process the intermediate results and treat the appearance of the same word The number of times is accumulated to get the total number of appearances of each word.
5. The core process of MapReduce-Shuffle[' fl] and Sort
Shuffle is the heart of mapreduce. Understanding this process will help you write more efficient mapreduce programs and Hadoop tuning.
Shuffle refers to the process of starting from the output of the Map, including the system performing sorting and transmitting the Map output to the Reducer as input. As shown below:
1. start the analysis from the Map side. When the Map starts to produce output, he does not simply write the data to the disk, because frequent operations will cause serious performance degradation, and his processing is more complicated. The data is first written to the memory. A buffer, and do some pre-sorting to improve efficiency, as shown in the figure:
The task has a "circular memory buffer" for writing "output data". The default size of this buffer is 100M (the specific size can be set through the io.sort.mb property). When the amount of data in the buffer is When a specific threshold (io.sort.mb * io.sort.spill.percent, io.sort.spill.percent is 0.80 by default) is reached, the system will start a background thread to spill the contents of the buffer to Disk. During the spill process, the output of the Map will continue to be written to the buffer, but if the buffer is full, the Map will be blocked until the spill is completed. Before the spill thread writes the data in the buffer to the disk, it will perform a secondary sorting, first sorting according to the partition to which the data belongs, and then sorting each partition by Key. The output includes an index file and data file. If the Combiner is set, it will be performed on the basis of the sorted output. Combiner is a Mini Reducer. It runs on the node itself that executes the Map task. 1. it performs a simple Reduce on the output of the Map to make the output of the Map more compact, and less data will be written to the disk and transmitted to the Reducer. The Spill file is saved in the directory specified by mapred.local.dir and deleted after the Map task is completed. Whenever the data in the memory reaches the spill threshold, a new spill file will be generated, so when the Map task finishes writing its last output record, there may be multiple spill files. Before the Map task is completed, All spill files will be merged and sorted into one index file and data file. As shown in Figure 3. This is a multi-channel merge process, and the maximum number of merged channels is controlled by io.sort.factor (the default is 10). If the Combiner is set and the number of spill files is at least 3 (controlled by the min.num.spills.for.combine attribute), then the Combiner will run to compress the data before the output file is written to the disk.
Dakuai Big Data Platform (DKH) is a one-stop search engine-level, big data general computing platform designed by Dakuai Company to open up the channel between the big data ecosystem and traditional non-big data companies. By using DKH, traditional companies can easily bridge the technological gap of big data and achieve search engine-level big data platform performance.
l DKH effectively integrates all the components of the entire HADOOP ecosystem, and is deeply optimized and recompiled into a complete and higher-performance big data general computing platform, realizing the organic coordination of various components. Therefore, compared with the open source big data platform, DKH has a performance improvement of up to 5 times (maximum) in terms of computing performance.
l DKH, through Dakuai s unique middleware technology, simplifies the complex big data cluster configuration to three types of nodes (primary node, management node, computing node), which greatly simplifies the management, operation and maintenance of the cluster, and enhances This improves the cluster's high availability, high maintainability, and high stability.
l Although DKH has been highly integrated, it still maintains all the advantages of open source systems and is 100% compatible with open source systems. Big data applications developed based on open source platforms can run efficiently on DKH without any changes. And the performance will be improved by up to 5 times.
Technical architecture diagram of DKH standard platform