mrjob is the easiest way to write python programs that can run on hadoop. Its most prominent feature is that with the help of mrjob, there is no need to install hadoop or deploy any cluster, and we can run the code (for testing) on our local machine. Also, mrjob can easily run on Amazon Elastic MapReduce.

Since Hadoop does not currently support the python API, we can only use the stream processing method to complete the mapredduce ren using python. mrjob is a good choice.

Installation

If you did not installed pip3 for Linux, then you can execute the following command:

1	sudo apt update && sudo apt install python3-pip

After that, you can install mrjob by using pip3:

1	pip3 install mrjob

To check if your installation is correct you can execute in the python3 terminal:

➜  ~ > python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mrjob
>>>

If there is no output from import, the installation is successful.

mrjob example

We use a generic word count example to demonstrate the use of mrjob.

from mrjob.job import MRJob

class WordCount(MRJob):

    def mapper(self, _, line):
        for word in line.split():
            yield (word, 1)

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    WordCount.run()

Using yield, you can let the function generate a sequence, the function returns an object of type “generator”, through the object to call the next() method to return the sequence of values.

Briefly understand: yield is return returns a value, and remember this return position, the next iteration will start from this position after (the next line).

For example:

➜  ~ > python3                         
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def count():
...     for i in range(0, 3):
...         yield i
... 
>>> num = count()
>>> num
<generator object count at 0x7f5786f7d900>
>>> num.__next__()
0
>>> num.__next__()
1
>>> num.__next__()
2
>>> num.__next__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

By execute the word_count.py in local environment:

1	python3 word_count.py input.txt

MapReduce

mapreduce is a system for processing large amounts of data on distributed systems. It is based on the paper MapReduce: Simplified Data Processing on Large Clusters. mapreduce divides massive data into small data sets, performs the same task in parallel, and finally collates and merges all the sub-results into the final result. The step of splitting the data for the same is called mapper, and the step of merging and organizing later is called reducer. combiner can be seen as an optimizer, but it is not necessary.

When we are calling the Hadoop framework with mrjob, we first need to configure Hadoop. You must make sure that yarn is started at the time of mrjob.

Configuration of mapred-site.xml :

<configuration>
    <property> 
        <name>mapreduce.framework.name</name> 
        <value>yarn</value> 
    </property> 

    <property> 
        <name>yarn.app.mapreduce.am.env</name> 
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> 
    </property> 

    <property> 
        <name>mapreduce.map.env</name> 
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> 
    </property> 

    <property> 
        <name>mapreduce.reduce.env</name> 
        <value>HADOOP_MAPRED_HOME=$HADOOP_MAPRED_HOME</value> 
    </property>
</configuration>

Configuration of yarn-site.xml:

<configuration>
    <property> 
        <name>yarn.nodemanager.aux-services</name> 
        <value>mapreduce_shuffle</value> 
    </property> 
</configuration>

Execute hadoop on mrjob

Use the same word count code, then we can execute the file by using:

1	python3 word_count.py -r hadoop input.txt

mrjob can also call the file above the hdfs:

1	python3 word_count.py -r hadoop hdfs:///input/input.txt

Hadoop, JAVA and mrjob installation