1. Apache Hadoop and the Hadoop Ecosystem
The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent data
storage.
MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring very large datasets.
Pig runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying
storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.
2. map Reduce Data flow
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output
is intermediate output: it’s processed by reduce tasks to produce the final output, and
once the job is complete the map output can be thrown away. So storing it in HDFS,
with replication, would be overkill. If the node running the map task fails before the
map output has been consumed by the reduce task, then Hadoop will automatically
rerun the map task on another node to re-create the map output.
Reduce tasks don’t have the advantage of data locality—the input to a single reduce
task is normally the output from all mappers. In the present example, we have a single
reduce task that is fed by all of the map tasks. Therefore, the sorted map outputs have
to be transferred across the network to the node where the reduce task is running, where
they are merged and then passed to the user-defined reduce function. The output of
the reduce is normally stored in HDFS for reliability. As explained in Chapter 3, for
each HDFS block of the reduce output, the first replica is stored on the local node, with
other replicas being stored on off-rack nodes. Thus, writing the reduce output does
consume network bandwidth, but only as much as a normal HDFS write pipeline
consumes.
3. Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function’s
output forms the input to the reduce function.
4. The Design of HDFS
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
Very large files
Streaming data access
Commodity hardware
Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications
5. Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in
the form of two files: the namespace image and the edit log. The namenode also knows
the datanodes on which all the blocks for a given file are located, however, it does
not store block locations persistently, since this information is reconstructed from
datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. The client presents a POSIX-like filesystem interface, so the user code does not need to know about the namenode and datanode to function.
Datanodes are the workhorses of the filesystem. They store and retrieve blocks when
they are told to (by clients or the namenode), and they report back to the namenode
periodically with lists of blocks that they are storing.
from hadoop definitive guide ch 1.2.3
댓글 없음:
댓글 쓰기