天天看點

Hierarchical Clustering

Clustering, in one sentence, is the extraction of natural groupings of similar data objects.

There are a couple of general ideas that occur quite frequently with respect to clustering:

  • The clusters should be naturally occurring in data.
  • The clustering should discover hidden patterns in the data.
  • Data points within the cluster should be similar.
  • Data points in two different clusters should not be similar.

Common algorithms used for clustering include K-Means, DBSCAN, and Gaussian Mixture Models.

Hierarchical Clustering

As mentioned before, hierarchical clustering relies using these clustering techniques to find a hierarchy of clusters, where this hierarchy resembles a tree structure, called a dendrogram.

Hierarchical clustering is the hierarchical decomposition of the data based on group similarities

Finding hierarcical clusters

There are two top-level methods for finding these hierarchical clusters:

  • Agglomerative clustering uses a bottom-up approach, wherein each data point starts in its own cluster. These clusters are then joined greedily, by taking the two most similar clusters together and merging them.
  • Divisive clustering uses a top-down approach, wherein all data points start in the same cluster. You can then use a parametric clustering algorithm like K-Means to divide the cluster into two clusters. For each cluster, you further divide it down to two clusters until you hit the desired number of clusters.

Both of these approaches rely on constructing a similarity matrix between all of the data points, which is usually calculated by cosine or Jaccard distance.

References

  • Hierarchical Clustering and its Applications
  • scikit-learn中的hierarchical clustering