天天看點

【python】【資料處理】畫多元資料分布圖

小姿勢:

  • Matplotlib中%matplotlib inline是什麼、如何使用 https://blog.csdn.net/liangzuojiayi/article/details/78183783
  • load_iris 可以加載sklearn自帶的鸢尾花資料集(根據花萼、花瓣的長寬分辨屬于哪一個類),資料格式:
data.feature_names(data['feature_names']):
 		['sepal length (cm)',
		 'sepal width (cm)',
		 'petal length (cm)',
		 'petal width (cm)']
data.target_names:
	array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
data['data']:
	array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       ......
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])
 data['target']:
 	array([0, 0, 0, 0,.......2, 2, 2])
           
  • sklearn.dataset可以加載很多種資料
  • t-SNE: https://blog.csdn.net/hustqb/article/details/78144384 詳細解釋了tsne的原理優缺點和使用方法

代碼:

–畫出手寫數字圖檔的資料分布圖

from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
import pandas as pd
           
digits = datasets.load_digits(n_class=10)
df = pd.DataFrame(digits.data)
label = digits.target
df['label']  = label
print(type(digits.data))
df
# orginal data
           
<class 'numpy.ndarray'>
           
1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 label
0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 11.0 16.0 10.0 0.0 0.0 1
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.0 11.0 16.0 9.0 0.0 2
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 0.0 0.0 0.0 7.0 13.0 13.0 9.0 0.0 0.0 3
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0 16.0 4.0 0.0 0.0 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1792 0.0 0.0 4.0 10.0 13.0 6.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 2.0 14.0 15.0 9.0 0.0 0.0 9
1793 0.0 0.0 6.0 16.0 13.0 11.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 6.0 16.0 14.0 6.0 0.0 0.0
1794 0.0 0.0 1.0 11.0 15.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 2.0 9.0 13.0 6.0 0.0 0.0 8
1795 0.0 0.0 2.0 10.0 7.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 5.0 12.0 16.0 12.0 0.0 0.0 9
1796 0.0 0.0 10.0 14.0 8.0 1.0 0.0 0.0 0.0 2.0 ... 0.0 0.0 1.0 8.0 12.0 14.0 12.0 1.0 0.0 8

1797 rows × 65 columns

result = tsne.fit_transform(digits.data)
result
           
array([[ -4.2510934,  57.605927 ],
       [ 27.768238 , -18.912882 ],
       [ 19.440983 ,  -7.737709 ],
       ...,
       [ 10.630893 , -12.436025 ],
       [-18.820362 ,  28.899649 ],
       [  6.5873857,  -8.608063 ]], dtype=float32)
           
# draw 2-dimension pic

x_min, x_max = np.min(result), np.max(result)

# 這一步似乎讓結果都變為0-1的數字
result = (result - x_min)/(x_max-x_min)
fig = plt.figure()
# subplot可以畫出一個矩形,長寬由參數的前兩位确定,參數越大,邊長越小
ax = plt.subplot(111)
for i in range(result.shape[0]):
    plt.text(result[i,0], result[i,1], str(label[i]), color=plt.cm.Set1(label[i] / 10.), fontdict={'weight': 'bold','size': 9})
plt.xticks([])
plt.yticks([])
plt.title('hello')
plt.show(fig)
           

結果:

【python】【資料處理】畫多元資料分布圖