Hadop Reklam

Sponsor Reklam

Sunday, August 31, 2014

K-Means Clustering

There are many different approaches to clustering, both in the broader machine learning community and within Mahout. For instance, Mahout alone, as of this writing, has clustering implementations called:

- Canopy
- Mean-Shift
- Dirichlet
- Spectral
- K-Means and Fuzzy K-Means

Of these choices, K-Means is easily the most widely known. K-Means is a simple and straightforward approach to clustering that often yields good results relatively quickly. . It operates by iteratively adding documents to one of k clusters based on the distance, as determined by a user-supplied distance measure, between the document and the centroid of that cluster. At the end of each iteration, the centroid may be recalculated.The process stops after there’s little-to-no change in the centroids or some maximum number of iterations have passed, since otherwise K-Means isn’t guaranteed to converge.The algorithm is kicked off by either seeding it with some initial centroids or by randomly choosing centroids from the set of vectors in the input dataset. K-Means does have some downsides. First and foremost, you must pick k and naturally you’ll get different results for different values of k. Furthermore, the initial choice for the centroids can greatly affect the outcome, so you should be sure to try different values as part of several runs. In the end, as with most techniques, it’s wise to run several iterations with various parameters to determine what works best for your data.

Running the K-Means clustering algorithm in Mahout is as simple as executing the
org.apache.mahout.clustering.kmeans.KMeansDriver class with the appropriate input parameters.
Thanks to the power of Hadoop, you can execute this in either standalone mode or distributed mode (on a Hadoop cluster). For the purposes of this article, we’ll use standalone mode, but there isn’t much difference for distributed mode.
Instead of looking at the options that KMeansDriver takes first, let’s go straight to an example using the
Vector dump we created earlier. The next listing shows an example command line for running the KMeansDriver.

No comments:

Post a Comment