Hadoop notes I

  • Introduction

  • Why big data?
    In case of Google:
    • ~40 billion web pages x 30 kb each = Petabyte
    • average disk speed reads about 120 mb/sec
    • over 3 months to read the web
    • about 1,000 drives to store and use
  • Distributed computing challenge

    • Scale out with distributed computing
    • Volume, velocity and variety
    • Recover from failures
    • Shared nothing architecture (name node & job tracker)

    a. Google file system (GFS) (2003), which the hadoop file system (HDFS) is based upon.
    b. MapReduce (2004)

  • Hadoop
    • Open source
    • Relax the complexities of distributed systems
    • Fault-tolerant distributed file system
    • API
  • HDFS
    • Tackle files rather than database
    • Split file into chunks or blocks (64 to 128 mb each)
    • Place each block of data on a different data node, and replicate each block to three nodes by default
    • Read data in the same rack
    • Hive query
  • MapReduce
    • Map
    • Shuffle and sort
    • Reduce