天天看点

【python】【数据处理】画多维数据分布图

小姿势:

  • Matplotlib中%matplotlib inline是什么、如何使用 https://blog.csdn.net/liangzuojiayi/article/details/78183783
  • load_iris 可以加载sklearn自带的鸢尾花数据集(根据花萼、花瓣的长宽分辨属于哪一个类),数据格式:
data.feature_names(data['feature_names']):
 		['sepal length (cm)',
		 'sepal width (cm)',
		 'petal length (cm)',
		 'petal width (cm)']
data.target_names:
	array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
data['data']:
	array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       ......
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])
 data['target']:
 	array([0, 0, 0, 0,.......2, 2, 2])
           
  • sklearn.dataset可以加载很多种数据
  • t-SNE: https://blog.csdn.net/hustqb/article/details/78144384 详细解释了tsne的原理优缺点和使用方法

代码:

–画出手写数字图片的数据分布图

from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
import pandas as pd
           
digits = datasets.load_digits(n_class=10)
df = pd.DataFrame(digits.data)
label = digits.target
df['label']  = label
print(type(digits.data))
df
# orginal data
           
<class 'numpy.ndarray'>
           
1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 label
0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 11.0 16.0 10.0 0.0 0.0 1
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.0 11.0 16.0 9.0 0.0 2
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 0.0 0.0 0.0 7.0 13.0 13.0 9.0 0.0 0.0 3
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0 16.0 4.0 0.0 0.0 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1792 0.0 0.0 4.0 10.0 13.0 6.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 2.0 14.0 15.0 9.0 0.0 0.0 9
1793 0.0 0.0 6.0 16.0 13.0 11.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 6.0 16.0 14.0 6.0 0.0 0.0
1794 0.0 0.0 1.0 11.0 15.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 2.0 9.0 13.0 6.0 0.0 0.0 8
1795 0.0 0.0 2.0 10.0 7.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 12.0 16.0 12.0 0.0 0.0 9
1796 0.0 0.0 10.0 14.0 8.0 1.0 0.0 0.0 0.0 2.0 ... 0.0 0.0 1.0 8.0 12.0 14.0 12.0 1.0 0.0 8

1797 rows × 65 columns

result = tsne.fit_transform(digits.data)
result
           
array([[ -4.2510934,  57.605927 ],
       [ 27.768238 , -18.912882 ],
       [ 19.440983 ,  -7.737709 ],
       ...,
       [ 10.630893 , -12.436025 ],
       [-18.820362 ,  28.899649 ],
       [  6.5873857,  -8.608063 ]], dtype=float32)
           
# draw 2-dimension pic

x_min, x_max = np.min(result), np.max(result)

# 这一步似乎让结果都变为0-1的数字
result = (result - x_min)/(x_max-x_min)
fig = plt.figure()
# subplot可以画出一个矩形,长宽由参数的前两位确定,参数越大,边长越小
ax = plt.subplot(111)
for i in range(result.shape[0]):
    plt.text(result[i,0], result[i,1], str(label[i]), color=plt.cm.Set1(label[i] / 10.), fontdict={'weight': 'bold','size': 9})
plt.xticks([])
plt.yticks([])
plt.title('hello')
plt.show(fig)
           

结果:

【python】【数据处理】画多维数据分布图