mrjob is the easiest way to write python programs that can run on hadoop. Its most prominent feature is that with the help of mrjob, there is no need to install hadoop or deploy any cluster, and we can run the code (for testing) on our local machine. Also, mrjob can easily run on Amazon Elastic MapReduce.
Since Hadoop does not currently support the python API, we can only use the stream processing method to complete the mapredduce ren using python. mrjob is a good choice.
Installation
If you did not installed pip3 for Linux, then you can execute the following command:
1 | sudo apt update && sudo apt install python3-pip |
After that, you can install mrjob by using pip3
:
1 | pip3 install mrjob |
To check if your installation is correct you can execute in the python3 terminal:
1 | ➜ ~ > python3 |
If there is no output from import, the installation is successful.
mrjob example
We use a generic word count example to demonstrate the use of mrjob.
1 | from mrjob.job import MRJob |
Using yield, you can let the function generate a sequence, the function returns an object of type “generator”, through the object to call the next() method to return the sequence of values.
Briefly understand: yield is return returns a value, and remember this return position, the next iteration will start from this position after (the next line).
For example:
1 | ➜ ~ > python3 |
By execute the word_count.py
in local environment:
1 | python3 word_count.py input.txt |
MapReduce
mapreduce is a system for processing large amounts of data on distributed systems. It is based on the paper MapReduce: Simplified Data Processing on Large Clusters. mapreduce divides massive data into small data sets, performs the same task in parallel, and finally collates and merges all the sub-results into the final result. The step of splitting the data for the same is called mapper, and the step of merging and organizing later is called reducer. combiner can be seen as an optimizer, but it is not necessary.
When we are calling the Hadoop framework with mrjob, we first need to configure Hadoop. You must make sure that yarn is started at the time of mrjob.
Configuration of mapred-site.xml
:
1 | <configuration> |
Configuration of yarn-site.xml
:
1 | <configuration> |
Execute hadoop on mrjob
Use the same word count code, then we can execute the file by using:
1 | python3 word_count.py -r hadoop input.txt |
mrjob can also call the file above the hdfs:
1 | python3 word_count.py -r hadoop hdfs:///input/input.txt |