Hadoop : Mahout K-means on Reuter Example

# For reuters Example

#~~~~~~~~~~~~~~~~~~~~

# Get the data first, I place it within the example folder from

mahout home director: mahout-0.5-cdh3u5/examples/reuters

mkdir reuters

cd reuters

wget

http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

mkdir reuters-out

mv reuters21578.tar.gz reuters-out

cd reuters-out

tar -xzvf reuters21578.tar.gz

cd ..

# Mahout steps

# (1) For reuters example, the original downloaded file is

in SGML format, which is similar to XML. So we need to first parse(like

preprocessing) those files into document-id and document-text. After that wecan convert the file into sequenceFiles. For sequencesFiles, key is the

sing

'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text datdocument id and value is the document content. This step will ben done

ua to

s (Vector Space Model: VSM)vecto

# For the first preprocessing job, a much

quicker way is to reuse the Reuters parser given in the

Lucene benchmark JAR file.

Because its bundled along with Mahout, all you need to do is

change to the examples/

directory under the Mahout source tree and run the

org.apache.lucene.benchmark

.utils.ExtractReuters class.

<http://manning.com/owen/MiA_SampleCh08.pdf>

# From the text file generate SGM files, note that generated

files reside in local

${MAHOUT_HOME}/bin/mahout

org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text

hadoop fs -copyFromLocal ./reuters-text/

/your-hdfs-path-to/reuters-text

# Then generate sequence-file

mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i

/your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8

-chunk 5

# Check the generated sequence-file

mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s

/your-hdfs-path-to/reuters-seqfiles/chunk-0 |less

# From sequence-file generate vector file

mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i

/your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors

-Dmapred.job.queue.name=your-queue-name

# take a look at it should have 7 items:

#reuters-vectors/df-count

#reuters-vectors/dictionary.file-0

#reuters-vectors/frequency.file-0

#reuters-vectors/tf-vectors

#reuters-vectors/tfidf-vectors

#reuters-vectors/tokenized-documents

#reuters-vectors/wordcount

mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors

# check the vector: reuters-vectors/tf-vectors/part-r-00000

mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors

# Run kmeans

mahout-0.5-cdh3u5:$./bin/mahout kmeans -i

reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c

0.1 -k 20 -x 10 -ow

# Check the cluster output

http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper

mahout clusterdump -s mahout-clusters/clusters-* -d

reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o

./cluster-output.txt

# Some other tips

You can set any other Hadoop parameters by doing:

mahout <options>

-D<hadoop_property>=<value>

Replace <hadoop_property> with the property name you

want to define and <value> with the value you want.

In cases where you reach an "OutOfMemoryException"

or "GC overhead limit exceeded", it may help to add the following

parameters to your Mahout job:

-Dmapred.child.ulimit=4718592

(required in order to change the memory heap allocations to

either the map or reduce phase)

-Dmapred.map.child.java.opts=-Xmx3g

(recommended extension of memory; up to 4g is acceptable)-Dmapred.reduce.child.java.opts=-Xmx3g

(recommended extension of memory; up to 4g is acceptable)-Dmapred.child.java.opts=-Xmx3g

(recommended extension of memory; up to 4g is acceptable;

using this will overwrite the map and reduce memory allocations with the one

specified here)

Hadoop

Pages

Hadop Reklam

Sponsor Reklam

Thursday, August 28, 2014

Mahout K-means on Reuter Example

No comments:

Post a Comment