Hadoop notes I

Posted on 2016-08-08 | In Big data |

1. Introduction

Introduction

Why big data?
In case of Google:
- ~40 billion web pages x 30 kb each = Petabyte
- average disk speed reads about 120 mb/sec
- over 3 months to read the web
- about 1,000 drives to store and use

Distributed computing challenge
- Scale out with distributed computing
- Volume, velocity and variety
- Recover from failures
- Shared nothing architecture (name node & job tracker)
a. Google file system (GFS) (2003), which the hadoop file system (HDFS) is based upon.
b. MapReduce (2004)

Hadoop
- Open source
- Relax the complexities of distributed systems
- Fault-tolerant distributed file system
- API

HDFS
- Tackle files rather than database
- Split file into chunks or blocks (64 to 128 mb each)
- Place each block of data on a different data node, and replicate each block to three nodes by default
- Read data in the same rack
- Hive query

MapReduce
- Map
- Shuffle and sort
- Reduce

Volcanohong

Good things are coming, just KEEP GOING...

GitHub

Introduction