Feedback

Write your feedback about EduOrk!


1. Meet Hadoop

      Data

      Data Storage and Analysis

      Comparison with Other System

      RDBMS

      Grid Computing

      Volunteer Computing

      A Brief History of Hadoop

      Apache Hadoop and the Hadoop Ecosystem

      Hadoop Releases

2. MapReduce

   A Weather Dataset

   Data Format

   Analyzing the Data with Unix Tools

   Analyzing the Data with Hadoop

   Map and Reduce

   Java MapReduce

   Scaling Out

   Data Flow

   Combiner Functions

   Running a Distributed MapReduce Job

   Hadoop Streaming

   Compiling and Running

3. The Hadoop Distributed File System (HDFS)

   The Design of HDFS

   HDFS Concepts

   Blocks

   Namenodes and Datanodes

   HDFS Federation

   HDFS High-Availability

   The Command-Line Interface

   Basic Filesystem Operations

   Hadoop Filesystems

   Interfaces

   The Java Interface

   Reading Data from a Hadoop URL

   Reading Data Using the FileSystem API

   Writing Data

   Directories

   Querying the Filesystem

   Deleting Data

   Data Flow

   Anatomy of a File Read

   Anatomy of a File Write

   Coherency Model

   Parallel Copying with distcp

   Keeping an HDFS Cluster Balanced

   Hadoop Archives

4. Hadoop I/O

   Data Integrity

   Data Integrity in HDFS

   LocalFileSystem

   ChecksumFileSystem

   Compression

   Codecs

   Compression and Input Splits

   Using Compression in MapReduce

   Serialization

   The Writable Interface

   Writable Classes

   File-Based Data Structures

   SequenceFile

   MapFile

5. Developing a MapReduce Application

   The Configuration API

   Combining Resources

   Variable Expansion

   Configuring the Development Environment

   Managing Configuration

   GenericOptionsParser, Tool, and ToolRunner

   Writing a Unit Test

   Mapper

   Reducer

   Running Locally on Test Data

   Running a Job in a Local Job Runner

   Testing the Driver

   Running on a Cluster

   Packaging

   Launching a Job

   The MapReduce Web UI

   Retrieving the Results

   Debugging a Job

   Hadoop Logs

   Tuning a Job

   Profiling Tasks

   MapReduce Workflows

   Decomposing a Problem into MapReduce Jobs

   JobControl

6. How MapReduce Works

   Anatomy of a MapReduce Job Run

   Classic MapReduce (MapReduce 1)

   Failures

   Failures in Classic MapReduce

   Failures in YARN

   Job Scheduling

   The Capacity Scheduler

   Shuffle and Sort

   The Map Side

   The Reduce Side

   Configuration Tuning

   Task Execution

   The Task Execution Environment

   Speculative Execution

   Output Committers

   Task JVM Reuse

   Skipping Bad Records

7. MapReduce Types and Formats

   MapReduce Types

   The Default MapReduce Job

   Input Formats

   Input Splits and Records

   Text Input

   Binary Input

   Multiple Inputs

   Database Input (and Output)

   Output Formats

   Text Output

   Binary Output

   Multiple Outputs

   Lazy Output

   Database Output

8. MapReduce Features

   Counters

   Built-in Counters

   User-Defined Java Counters

   User-Defined Streaming Counters

   Sorting

   Preparation

   Partial Sort

   Total Sort

   Secondary Sort

   Joins

   Map-Side Joins

   Reduce-Side Joins

   Side Data Distribution

   Using the Job Configuration

   Distributed Cache

   MapReduce Library Classes

9. Setting Up a Hadoop Cluster

   Cluster Specification

   Network Topology

   Cluster Setup and Installation

   Installing Java

   Creating a Hadoop User

   Installing Hadoop

   Testing the Installation

   SSH Configuration

   Hadoop Configuration

   Configuration Management

   Environment Settings

   Important Hadoop Daemon Properties

   Hadoop Daemon Addresses and Ports

   Other Hadoop Properties

   User Account Creation

   YARN Configuration

   Important YARN Daemon Properties

   YARN Daemon Addresses and Ports

   Security

   Kerberos and Hadoop

   Delegation Tokens

   Other Security Enhancements

   Benchmarking a Hadoop Cluster

   Hadoop Benchmarks

   User Jobs

   Hadoop in the Cloud

   Hadoop on Amazon EC2

10. Administering Hadoop

   HDFS

   Persistent Data Structures

   Safe Mode

   Audit Logging

   Tools

   Monitoring

   Logging

   Metrics

   Java Management Extensions

   Routine Administration Procedures

   Commissioning and Decommissioning Nodes

   Upgrades

11. Pig

   Installing and Running Pig

   Execution Types

   Running Pig Programs

   Grunt

   Pig Latin Editors

   An Example

   Generating Examples

   Comparison with Databases

   Pig Latin

   Structure

   Statements

   Expressions

   Types

   Schemas

   Functions

   Macros

   User-Defined Functions

   A Filter UDF

   An Eval UDF

   A Load UDF

   Data Processing Operators

   Loading and Storing Data

   Filtering Data

   Grouping and Joining Data

   Sorting Data

   Combining and Splitting Data

   Pig in Practice

   Parallelism

   Parameter Substitution

12. Hive

   Installing Hive

   The Hive Shell

   An Example

   Running Hive

   Configuring Hive

   Hive Services

   Comparison with Traditional Databases

   Schema on Read Versus Schema on Write

   Updates, Transactions, and Indexes

   HiveQL

   Data Types

   Operators and Functions

   Tables

   Managed Tables and External Tables

   Partitions and Buckets

   Storage Formats

   Importing Data

   Altering Tables

   Dropping Tables

   Querying Data

   Sorting and Aggregating

   MapReduce Scripts

   Joins

   Subqueries

   Views

   User-Defined Functions

   Writing a UDF

   Writing a UDAF

13. Hbase

   Backdrop

   Concepts

   Whirlwind Tour of the Data Model

   Implementation

   Installation

   Test Drive

   Clients

   Java

   Avro, REST, and Thrift

   Schemas

   Loading Data

   Web Queries

   HBase Versus RDBMS

   Successful Service

   Hbase

14. ZooKeeper

   Installing and Running ZooKeeper

   Group Membership in ZooKeeper

   Creating the Group

   Joining a Group

   Listing Members in a Group

   Deleting a Group

   The ZooKeeper Service

   Data Model

   Operations

   Implementation

   Consistency

   Sessions

   States

15. Sqoop

   Getting Sqoop

   A Sample Import

   Generated Code

   Additional Serialization Systems

   Database Imports: A Deeper Look

   Controlling the Import

   Imports and Consistency

   Direct-mode Imports

   Working with Imported Data

   Imported Data and Hive

   Importing Large Objects

16. Flume

   Introduction

o Overview

o Architecture

   Data flow model

   Reliability

   Building Flume

o Getting the source

o Compile/test Flume

   Developing custom components

o Client

    Client SDK

    RPC client interface

   RPC clients - Avro and Thrift

  Failover Client

  Load Balancing RPC client

o Embedded agent

o Transaction interface

o Sink

o Source

o Channel