DBSCAN聚類算法的Python可視化

DBSCAN全稱為“Density-Based Spatial Clustering of Applications with Noise”。我們可以利用sklearn在python中實作DBSCAN。

首先，import相關的Library。

import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors

我們首先定義一個function來建立我們需要的資料集，資料集的dimension為2。下圖為我們将要建立的資料集的可視化。這個資料集由三個圓圈組成。在我們定義的function中，r代表半徑，n代表點的數量。

DBSCAN聚類算法的Python可視化

np.random.seed(42)
def PointsInCircum(r,n=100):
    return [(math.cos(2*math.pi/n*x)*r+np.random.normal(-30,30),math.sin(2*math.pi/n*x)*r+np.random.normal(-30,30)) for x in range(1,n+1)]

我們把建立的三個圓圈資料放在各自的dataframe裡面，再制造一個noise資料集用來測試DBSCAN。

df1=pd.DataFrame(PointsInCircum(500,1000))
df2=pd.DataFrame(PointsInCircum(300,700))
df3=pd.DataFrame(PointsInCircum(100,300))
# Adding noise to the dataset
df4=pd.DataFrame([(np.random.randint(-600,600),np.random.randint(-600,600)) for i in range(300)])

将四個dataframe合并成一個dataframe，再進行可視化。

df = pd.concat([df1,df2])
df = pd.concat([df,df3])
df = pd.concat([df,df4])

plt.figure(figsize=(10,10))
plt.scatter(df[0],df[1],s=15,color='grey')
plt.title('Dataset',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

接着，我們利用從sklearn中import的DBSCAN。将dataframe輸入DBSCAN，然後在原來的dataframe中添加一個column記錄DBSCAN輸出的labels，并以這些labels作為color map進行可視化。

dbscan=DBSCAN()
dbscan.fit(df[[0,1]])

df['DBSCAN_labels']=dbscan.labels_ 

# Plotting resulting clusters
plt.figure(figsize=(10,10))
colors=['purple','red','blue','green']
plt.scatter(df[0],df[1],c=df['DBSCAN_labels'],cmap=matplotlib.colors.ListedColormap(colors),s=15)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

可視化的結果如下圖：

DBSCAN聚類算法的Python可視化

我們看到整個圖都是紫色，證明cluster的半徑epsilon太小，DBSCAN把所有的點都當成noise了。我們可以利用KNN對epsilon進行優化。

neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(df[[0,1]])
distances, indices = nbrs.kneighbors(df[[0,1]])

# Plotting K-distance Graph
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.figure(figsize=(20,10))
plt.plot(distances)
plt.title('K-distance Graph',fontsize=20)
plt.xlabel('Data Points sorted by distance',fontsize=14)
plt.ylabel('Epsilon',fontsize=14)
plt.show()

DBSCAN聚類算法的Python可視化

這個圖中，曲線斜率最大的位置對應epsilon等于30。重新修改DBSCAN，并把minPoints設定為6。

dbscan_opt=DBSCAN(eps=30,min_samples=6)
dbscan_opt.fit(df[[0,1]])

df['DBSCAN_opt_labels']=dbscan_opt.labels_

# Plotting the resulting clusters
plt.figure(figsize=(10,10))
plt.scatter(df[0],df[1],c=df['DBSCAN_opt_labels'],cmap=matplotlib.colors.ListedColormap(colors),s=15)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

DBSCAN聚類算法的Python可視化

DBSCAN聚類算法的Python可視化

繼續閱讀

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

筆試面試題目：滑動視窗(二)

27. Remove Element(清單)題目代碼

資料結構與算法（27）——排序（二）

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入

hdu7108哈希