Hadop Reklam

Sponsor Reklam

Thursday, August 28, 2014

Mahout K-means on Reuter Example


# For reuters Example
#~~~~~~~~~~~~~~~~~~~~

# Get the data first, I place it within the example folder from
mahout home director: mahout-0.5-cdh3u5/examples/reuters


mkdir reuters
cd reuters

wget
http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz


mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..
# Mahout steps

# (1) For reuters example, the original downloaded file is
in SGML format, which is similar to XML. So we need to first parse(like
preprocessing) those files into document-id and document-text. After that wecan convert the file into sequenceFiles. For sequencesFiles, key is the 
sing
'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text datdocument id and value is the document content. This step will ben done 
ua to 
s (Vector Space Model: VSM)vecto
r


# For the first preprocessing job, a much

quicker way is to reuse the Reuters parser given in the
Lucene benchmark JAR file.



Because its bundled along with Mahout, all you need to do is
change to the examples/



directory under the Mahout source tree and run the
org.apache.lucene.benchmark



.utils.ExtractReuters class.
<http://manning.com/owen/MiA_SampleCh08.pdf> 



# From the text file generate SGM files, note that generated
files reside in local



${MAHOUT_HOME}/bin/mahout
org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text



hadoop fs -copyFromLocal ./reuters-text/
/your-hdfs-path-to/reuters-text


# Then generate sequence-file

mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i
/your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8
-chunk 5 


# Check the generated sequence-file

mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s
/your-hdfs-path-to/reuters-seqfiles/chunk-0 |less


# From sequence-file generate vector file

mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i
/your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors
-Dmapred.job.queue.name=your-queue-name


# take a look at it should have 7 items:
#reuters-vectors/df-count
#reuters-vectors/dictionary.file-0
#reuters-vectors/frequency.file-0
#reuters-vectors/tf-vectors
#reuters-vectors/tfidf-vectors
#reuters-vectors/tokenized-documents
#reuters-vectors/wordcount
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors
# check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors
# Run kmeans

mahout-0.5-cdh3u5:$./bin/mahout kmeans -i
reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c
0.1 -k 20 -x 10 -ow


# Check the cluster output

#
http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper



mahout clusterdump -s mahout-clusters/clusters-* -d
reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o
./cluster-output.txt


# Some other tips
You can set any other Hadoop parameters by doing:

mahout <options>
-D<hadoop_property>=<value>



Replace <hadoop_property> with the property name you
want to define and <value> with the value you want.



In cases where you reach an "OutOfMemoryException"
or "GC overhead limit exceeded", it may help to add the following
parameters to your Mahout job:


-Dmapred.child.ulimit=4718592

(required in order to change the memory heap allocations to
either the map or reduce phase)

-Dmapred.map.child.java.opts=-Xmx3g
(recommended extension of memory; up to 4g is acceptable)-Dmapred.reduce.child.java.opts=-Xmx3g
(recommended extension of memory; up to 4g is acceptable)-Dmapred.child.java.opts=-Xmx3g

(recommended extension of memory; up to 4g is acceptable;
using this will overwrite the map and reduce memory allocations with the one
specified here)

No comments:

Post a Comment