Introduction
- Why big data?
In case of Google:
Distributed computing challenge
- Scale out with distributed computing
- Volume, velocity and variety
- Recover from failures
- Shared nothing architecture (name node & job tracker)
a. Google file system (GFS) (2003), which the hadoop file system (HDFS) is based upon.
b. MapReduce (2004)
- Hadoop
- Open source
- Relax the complexities of distributed systems
- Fault-tolerant distributed file system
- API
- HDFS
- Tackle files rather than database
- Split file into chunks or blocks (64 to 128 mb each)
- Place each block of data on a different data node, and replicate each block to three nodes by default
- Read data in the same rack
- Hive query
- MapReduce
- Map
- Shuffle and sort
- Reduce