Walk-through
The WordCount application is quite straightforward.
The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat(line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of <word, 1>.
For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1>
The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
We’ll learn more about the number of maps spawned for a given job, and how to control them in a fine-grained manner, a bit later in the tutorial.
WordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys.
The output of the first map: < Bye, 1> < Hello, 1> < World, 2>
The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1>
The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (that is, words in this example).
Thus the output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key-value types, input/output formats etc., in the JobConf. It then calls the JobClient.runJob (line 55) to submit the and monitor its progress.
We’ll learn more about JobConf, JobClient, Tool, and other interfaces and classes a bit later in the tutorial.
No comments:
Post a Comment