小姿势:
- Matplotlib中%matplotlib inline是什么、如何使用 https://blog.csdn.net/liangzuojiayi/article/details/78183783
- load_iris 可以加载sklearn自带的鸢尾花数据集(根据花萼、花瓣的长宽分辨属于哪一个类),数据格式:
data.feature_names(data['feature_names']):
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
data.target_names:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
data['data']:
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
......
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
data['target']:
array([0, 0, 0, 0,.......2, 2, 2])
- sklearn.dataset可以加载很多种数据
- t-SNE: https://blog.csdn.net/hustqb/article/details/78144384 详细解释了tsne的原理优缺点和使用方法
代码:
–画出手写数字图片的数据分布图
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
import pandas as pd
digits = datasets.load_digits(n_class=10)
df = pd.DataFrame(digits.data)
label = digits.target
df['label'] = label
print(type(digits.data))
df
# orginal data
<class 'numpy.ndarray'>
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | label | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.0 | 0.0 | 5.0 | 13.0 | 9.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 13.0 | 10.0 | 0.0 | 0.0 | 0.0 | ||
1 | 0.0 | 0.0 | 0.0 | 12.0 | 13.0 | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 16.0 | 10.0 | 0.0 | 0.0 | 1 |
2 | 0.0 | 0.0 | 0.0 | 4.0 | 15.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 11.0 | 16.0 | 9.0 | 0.0 | 2 |
3 | 0.0 | 0.0 | 7.0 | 15.0 | 13.0 | 1.0 | 0.0 | 0.0 | 0.0 | 8.0 | ... | 0.0 | 0.0 | 0.0 | 7.0 | 13.0 | 13.0 | 9.0 | 0.0 | 0.0 | 3 |
4 | 0.0 | 0.0 | 0.0 | 1.0 | 11.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 16.0 | 4.0 | 0.0 | 0.0 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1792 | 0.0 | 0.0 | 4.0 | 10.0 | 13.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 14.0 | 15.0 | 9.0 | 0.0 | 0.0 | 9 |
1793 | 0.0 | 0.0 | 6.0 | 16.0 | 13.0 | 11.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 6.0 | 16.0 | 14.0 | 6.0 | 0.0 | 0.0 | |
1794 | 0.0 | 0.0 | 1.0 | 11.0 | 15.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 2.0 | 9.0 | 13.0 | 6.0 | 0.0 | 0.0 | 8 |
1795 | 0.0 | 0.0 | 2.0 | 10.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 5.0 | 12.0 | 16.0 | 12.0 | 0.0 | 0.0 | 9 |
1796 | 0.0 | 0.0 | 10.0 | 14.0 | 8.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | ... | 0.0 | 0.0 | 1.0 | 8.0 | 12.0 | 14.0 | 12.0 | 1.0 | 0.0 | 8 |
1797 rows × 65 columns
result = tsne.fit_transform(digits.data)
result
array([[ -4.2510934, 57.605927 ],
[ 27.768238 , -18.912882 ],
[ 19.440983 , -7.737709 ],
...,
[ 10.630893 , -12.436025 ],
[-18.820362 , 28.899649 ],
[ 6.5873857, -8.608063 ]], dtype=float32)
# draw 2-dimension pic
x_min, x_max = np.min(result), np.max(result)
# 这一步似乎让结果都变为0-1的数字
result = (result - x_min)/(x_max-x_min)
fig = plt.figure()
# subplot可以画出一个矩形,长宽由参数的前两位确定,参数越大,边长越小
ax = plt.subplot(111)
for i in range(result.shape[0]):
plt.text(result[i,0], result[i,1], str(label[i]), color=plt.cm.Set1(label[i] / 10.), fontdict={'weight': 'bold','size': 9})
plt.xticks([])
plt.yticks([])
plt.title('hello')
plt.show(fig)
结果: