小姿勢:
- Matplotlib中%matplotlib inline是什麼、如何使用 https://blog.csdn.net/liangzuojiayi/article/details/78183783
- load_iris 可以加載sklearn自帶的鸢尾花資料集(根據花萼、花瓣的長寬分辨屬于哪一個類),資料格式:
data.feature_names(data['feature_names']):
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
data.target_names:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
data['data']:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
......
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
data['target']:
array([0, 0, 0, 0,.......2, 2, 2])
- sklearn.dataset可以加載很多種資料
- t-SNE: https://blog.csdn.net/hustqb/article/details/78144384 詳細解釋了tsne的原理優缺點和使用方法
代碼:
–畫出手寫數字圖檔的資料分布圖
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
import pandas as pd
digits = datasets.load_digits(n_class=10)
df = pd.DataFrame(digits.data)
label = digits.target
df['label'] = label
print(type(digits.data))
df
# orginal data
<class 'numpy.ndarray'>
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | label | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 0.0 | 5.0 | 13.0 | 9.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 13.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||
1 | 0.0 | 0.0 | 0.0 | 12.0 | 13.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 16.0 | 10.0 | 0.0 | 0.0 | 1 |
2 | 0.0 | 0.0 | 0.0 | 4.0 | 15.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 16.0 | 9.0 | 0.0 | 2 |
3 | 0.0 | 0.0 | 7.0 | 15.0 | 13.0 | 1.0 | 0.0 | 0.0 | 0.0 | 8.0 | ... | 0.0 | 0.0 | 0.0 | 7.0 | 13.0 | 13.0 | 9.0 | 0.0 | 0.0 | 3 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 11.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 16.0 | 4.0 | 0.0 | 0.0 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1792 | 0.0 | 0.0 | 4.0 | 10.0 | 13.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 14.0 | 15.0 | 9.0 | 0.0 | 0.0 | 9 |
1793 | 0.0 | 0.0 | 6.0 | 16.0 | 13.0 | 11.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 16.0 | 14.0 | 6.0 | 0.0 | 0.0 | |
1794 | 0.0 | 0.0 | 1.0 | 11.0 | 15.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | 13.0 | 6.0 | 0.0 | 0.0 | 8 |
1795 | 0.0 | 0.0 | 2.0 | 10.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 5.0 | 12.0 | 16.0 | 12.0 | 0.0 | 0.0 | 9 |
1796 | 0.0 | 0.0 | 10.0 | 14.0 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | ... | 0.0 | 0.0 | 1.0 | 8.0 | 12.0 | 14.0 | 12.0 | 1.0 | 0.0 | 8 |
1797 rows × 65 columns
result = tsne.fit_transform(digits.data)
result
array([[ -4.2510934, 57.605927 ],
[ 27.768238 , -18.912882 ],
[ 19.440983 , -7.737709 ],
...,
[ 10.630893 , -12.436025 ],
[-18.820362 , 28.899649 ],
[ 6.5873857, -8.608063 ]], dtype=float32)
# draw 2-dimension pic
x_min, x_max = np.min(result), np.max(result)
# 這一步似乎讓結果都變為0-1的數字
result = (result - x_min)/(x_max-x_min)
fig = plt.figure()
# subplot可以畫出一個矩形,長寬由參數的前兩位确定,參數越大,邊長越小
ax = plt.subplot(111)
for i in range(result.shape[0]):
plt.text(result[i,0], result[i,1], str(label[i]), color=plt.cm.Set1(label[i] / 10.), fontdict={'weight': 'bold','size': 9})
plt.xticks([])
plt.yticks([])
plt.title('hello')
plt.show(fig)
結果: