Hadoop Streaming
Welcome to this comprehensive, student-friendly guide on Hadoop Streaming! 🎉 If you’re new to Hadoop or just looking to understand how to process large datasets using simple scripts, you’re in the right place. Don’t worry if this seems complex at first; we’ll break it down step by step. By the end of this tutorial, you’ll be able to run your own Hadoop Streaming jobs with confidence!
What You’ll Learn 📚
- Introduction to Hadoop Streaming
- Core concepts and terminology
- Simple and progressively complex examples
- Common questions and answers
- Troubleshooting common issues
Introduction to Hadoop Streaming
Hadoop Streaming is a utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer. This means you can use languages like Python, Perl, or even bash scripts to process your data. It’s a flexible way to harness the power of Hadoop without needing to write Java code.
Key Terminology
- Mapper: A function that processes input data and emits key-value pairs.
- Reducer: A function that processes key-value pairs emitted by the mapper and produces the final output.
- HDFS: Hadoop Distributed File System, where data is stored in a distributed manner.
Getting Started with a Simple Example 🚀
Example 1: Word Count with Python
Let’s start with the classic word count example. We’ll use Python scripts for the mapper and reducer.
Setup Instructions
- Ensure Hadoop is installed and running on your system.
- Prepare a text file in HDFS. For example, upload a file named
input.txt
containing some text.
hadoop fs -put input.txt /user/hadoop/input
Mapper Script (mapper.py)
#!/usr/bin/env python3
import sys
# Input comes from standard input (stdin)
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Split the line into words
words = line.split()
# Increase counters
for word in words:
# Write the results to standard output (stdout)
print(f'{word}\t1')
Reducer Script (reducer.py)
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
word = None
# Input comes from standard input
for line in sys.stdin:
# Remove leading and trailing whitespace
line = line.strip()
# Parse the input we got from mapper.py
word, count = line.split('\t', 1)
# Convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# Count was not a number, so silently ignore/discard this line
continue
# This IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# Write result to standard output
print(f'{current_word}\t{current_count}')
current_count = count
current_word = word
# Do not forget to output the last word if needed!
if current_word == word:
print(f'{current_word}\t{current_count}')
Running the Hadoop Streaming Job
hadoop jar /path/to/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output \
-mapper mapper.py \
-reducer reducer.py
Expected Output: A directory /user/hadoop/output
containing the word count results.
Progressively Complex Examples
Example 2: Filtering Data
Let’s filter lines containing a specific word using a Python script.
Mapper Script (filter_mapper.py)
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
if 'specific_word' in line:
print(line)
Running the Job
hadoop jar /path/to/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output_filter \
-mapper filter_mapper.py \
-reducer /bin/cat
Expected Output: Lines containing ‘specific_word’ in the output directory.
Example 3: Data Aggregation
Aggregate numerical data using a Python reducer.
Mapper Script (sum_mapper.py)
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
try:
number = int(line)
print(f'number\t{number}')
except ValueError:
continue
Reducer Script (sum_reducer.py)
#!/usr/bin/env python3
import sys
total_sum = 0
for line in sys.stdin:
line = line.strip()
_, number = line.split('\t')
total_sum += int(number)
print(f'Total Sum: {total_sum}')
Running the Job
hadoop jar /path/to/hadoop-streaming.jar \
-input /user/hadoop/input/numbers.txt \
-output /user/hadoop/output_sum \
-mapper sum_mapper.py \
-reducer sum_reducer.py
Expected Output: Total sum of numbers in the output directory.
Common Questions and Answers 🤔
- What is Hadoop Streaming?
Hadoop Streaming is a utility that allows you to run MapReduce jobs with any script or executable as the mapper and reducer.
- Why use Hadoop Streaming?
It provides flexibility to use languages other than Java, making it easier for those familiar with scripting languages to process data.
- What languages can I use with Hadoop Streaming?
You can use any language that can read from standard input and write to standard output, such as Python, Perl, or bash.
- How do I handle errors in my scripts?
Ensure your scripts handle exceptions and edge cases gracefully, and use logging to debug issues.
- Can I use multiple reducers?
Yes, you can specify the number of reducers using the
-numReduceTasks
option.
Troubleshooting Common Issues 🛠️
Issue: Job fails with ‘File Not Found’ error.
Solution: Ensure that the input file path in HDFS is correct and accessible.
Issue: Mapper or reducer script not found.
Solution: Check the script paths and ensure they are executable. Use
chmod +x script.py
to make them executable.
Issue: Output directory already exists.
Solution: Remove the existing output directory using
hadoop fs -rm -r /user/hadoop/output
before running the job again.
Practice Exercises 🏋️♂️
- Exercise 1: Modify the word count example to ignore case sensitivity.
- Exercise 2: Create a mapper that filters out lines shorter than 5 characters.
- Exercise 3: Write a reducer that calculates the average of numbers.
Try these exercises to reinforce your understanding and don’t hesitate to experiment with different scripts and data!
Additional Resources 📖
Keep practicing and exploring, and you’ll become proficient in Hadoop Streaming in no time! 🚀