拓端tecdat|R語言輔導中不同類型的聚類方法比較

R語言中不同類型的聚類方法比較

聚類方法用于識别從營銷，生物醫學和地理空間等領域收集的多變量資料集中的相似對象。它們是不同類型的聚類方法，包括：

劃分方法
分層聚類
模糊聚類
基于密度的聚類
基于模型的聚類

資料準備

示範資料集：名為USArrest的内置R資料集
删除丢失的資料
縮放變量以使它們具有可比性

# Load  and prepare the data

my_data <- USArrests %>%
  na.omit() %>%          # Remove missing values (NA)
  scale()                # Scale variables

# View the firt 3 rows
head(my_data, n = 3)

##         Murder Assault UrbanPop     Rape
## Alabama 1.2426   0.783   -0.521 -0.00342
## Alaska  0.5079   1.107   -1.212  2.48420
## Arizona 0.0716   1.479    0.999  1.04288

距離

get_dist() ：用于計算資料矩陣的行之間的距離矩陣。與标準 dist() 功能相比，它支援基于相關的距離測量，包括“皮爾遜”，“肯德爾”和“斯皮爾曼”方法。
fviz_dist() ：用于可視化距離矩陣

res.dist <- get_dist(U
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

劃分聚類

、算法是将資料集細分為一組k個組的聚類技術，其中k是分析人員預先指定的組的數量。

k-means聚類的替代方案是K-medoids聚類或PAM（Partitioning Around Medoids，Kaufman和Rousseeuw，1990），與k-means相比，它對異常值不太敏感。

以下R代碼顯示如何确定最佳簇數以及如何在R中計算k-means和PAM聚類。

确定最佳簇數

fviz_nbclust(my_data, kmeans, method = "gap_stat")

計算并可視化k均值聚類

set.seed(123)
 # Visualize

fviz_cluster(km.res, data = my_data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())

# Compute PAM

pam.res <- pam(my_data, 3)
# Visualize
fviz_cluster(pam.res)

分層聚類

分層聚類是一種分區聚類的替代方法，用于識别資料集中的組。它不需要預先指定要生成的簇的數量。

# Compute hierarchical clustering
res.hc <- USArrests %>%
  scale() %>%                    # Scale the data
   hclust(method = "ward.D2")     # Compute hierachical clustering

# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
            color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

評估聚類傾向

為了評估聚類傾向，可以使用Hopkins的統計量和視覺方法。

Hopkins統計：如果Hopkins統計量的值接近1（遠高于0.5），那麼我們可以得出結論，資料集是顯着可聚類的。
視覺方法：視覺方法通過計算有序相異度圖像中沿對角線的方形黑暗（或彩色）塊的數量來檢測聚類趨勢。

R代碼：

iris[, -5] %>%    # Remove column 5 (Species)
  scale() %>%     # Scale variables
  get_clust_tendency(n = 50, gradient = gradient.color)

## $hopkins_stat
## [1] 0.8
## 
## $plot

确定最佳簇數

set.seed(123)

# Compute

res.nbclust <- USArrests %>%
  scale() %>%
   (distance = "euclidean",
          min.nc = 2, max.nc = 10, 
          method = "complete", index ="all")

# Visualize

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 9 proposed  2 as the best number of clusters
## * 4 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 2 proposed  5 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

群集驗證統計資訊

在下面的R代碼中，我們将計算和評估層次聚類方法的結果。

計算和可視化層次聚類：

# Enhanced hierarchical clustering, cut in 3 groups
res.hc <- iris[, -5] %>%
  scale() %>%
   ("hclust", k = 3, graph = FALSE)

# Visualize with factoextra
 (res.hc, palette = "jco",
          rect = TRUE, show_labels = FALSE) 


檢查輪廓圖：

(res.hc)

##   cluster size ave.sil.width
## 1       1   49          0.63
## 2       2   30          0.44
## 3       3   71          0.32

哪些樣品有負面輪廓？他們更接近什麼叢集？

# Silhouette width of observations
sil <- res.hc$silinfo$widths[, 1:3]

# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]

##     cluster neighbor sil_width
## 84        3        2   -0.0127
## 122       3        2   -0.0179
## 62        3        2   -0.0476
## 135       3        2   -0.0530
## 73        3        2   -0.1009
## 74        3        2   -0.1476
## 114       3        2   -0.1611
## 72        3        2   -0.2304

進階聚類方法

混合聚類方法

分層K均值聚類：一種改進k均值結果的混合方法
HCPC：主成分上的分層聚類

模糊聚類

模糊聚類也稱為軟聚類方法。标準聚類方法（K-means，PAM），其中每個觀察僅屬于一個聚類。這稱為硬聚類。

基于模型的聚類

在基于模型的聚類中，資料被視為來自兩個或多個聚類的混合的分布。它找到了最适合模型的資料并估計了簇的數量。

DBSCAN：基于密度的聚類

DBSCAN是Ester等人引入的聚類方法。（1996）。它可以從包含噪聲和異常值的資料中找出不同形狀和大小的簇（Ester等，1996）。基于密度的聚類方法背後的基本思想源于人類直覺的聚類方法。

R鍊中的DBSCAN的描述和實作

拓端tecdat|R語言輔導中不同類型的聚類方法比較

資料準備

距離

劃分聚類

分層聚類

評估聚類傾向

确定最佳簇數

群集驗證統計資訊

進階聚類方法

混合聚類方法

模糊聚類

基于模型的聚類

DBSCAN：基于密度的聚類

繼續閱讀

【FPGA實作GA】基于FPGA的GA優化算法的設計與實作

擴容TIKV節點遇到的坑

PHP輔導代做程式設計：CS353 Database System

解決方案之：DM relay 處理單元報錯

Perl與網絡監控

自學Zabbix3.10.2-事件通知Notifications upon events-Actions報警配置點選傳回：自學zabbix集錦

HDU 5678 ztr loves trees

拓端tecdat|R語言彈性網絡Elastic Net正則化懲罰回歸模型交叉驗證可視化

二叉樹及其應用--二叉樹建立

PAT (Advanced Level) Practise 1131 Subway Map (30)

ZOJ 3938 Defuse the Bomb

CSU 1565 Word Cloud

ZOJ 3700 Ever Dream

ZOJ 1199 Point of Intersection

CSU 1567 Reverse Rot

詳解STM32單片機的堆棧

拓端tecdat|R語言輔導中不同類型的聚類方法比較

資料準備

距離

劃分聚類

分層聚類

評估聚類傾向

确定最佳簇數

群集驗證統計資訊

進階聚類方法

混合聚類方法

​

模糊聚類

基于模型的聚類

DBSCAN：基于密度的聚類

​

繼續閱讀