pandas入門指南

上一篇講了numpy，除此之外，還有一個工具我們一定會使用，那就是pandas。如果說numpy中資料存儲形式是清單的話，那麼pandas中資料的存儲形式更像是字典。為什麼這麼說呢？因為pandas中的資料每一行每一列都有名字，而numpy中沒有。本文主要介紹pandas的基本使用方法，更多進階用法大家可以參考 pandas官方文檔

一、pandas的安裝及導入

安裝：指令行中輸入以下代碼

pip3 install pandas

導入：為了簡便，這裡使用pd作為pandas的縮寫（因為pandas依賴numpy，是以在使用之前需要安裝和導入numpy）

import numpy as np
import pandas as pd

二、建立pandas清單、矩陣及其屬性

建立方法：

pd.Series：建立pandas清單

pd.date_range：建立pandas日期清單

pd.DataFrame：建立pandas矩陣

矩陣屬性

dtypes：資料類型

index：行名

columns：列名

values：資料值

describe()：實值資料列的統計資料

T：矩陣的倒置

sort_index(axis=, ascending=)：矩陣排序{axis：0(行排序)，1(列排序)}{ascending：True(升序)，False(降序)}

sort_values(by=, ascending=)：按某一列的值排序{by：列名}

s = pd.Series([1, 3, 6, np.nan, 23, 3])
dates = pd.date_range('20180708', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame({
  'a':pd.Series([1, 2, 3, 4]),
  'b':pd.Timestamp('20180708'),
  'c':pd.Categorical(['cate1', 'cate2', 'cate3', 'cate4'])
})
print(df2)
print(df2.dtypes)
print(df2.index)
print(df2.columns)
print(df2.values)
print(df2.describe())
print(df2.T)
print(df2.sort_index(axis=1, ascending=False))
print(df2.sort_index(axis=0, ascending=False))
print(df2.sort_values(by='a', ascending=False))

三、pandas選擇資料

.列名：選擇某一列

[列名]：選擇某一列

[start : end]：選擇行索引以start開頭，end - 1結尾的資料

[行名start：行名end]：選擇行名以start開頭，end結尾的資料

loc[行名選擇, 列名選擇]：根據行名和列名選擇資料

iloc[行索引選擇, 列索引選擇]：根據行索引和列索引選擇資料

ix[行名/索引選擇，列名/索引選擇]：混合名/索引選擇資料

[布爾表達式]：根據布爾表達式結果選擇資料，隻有當布爾表達式為真時的資料才會被選擇

dates = pd.date_range('20180709', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df.A)
print(df['A'])
print(df[2:3])
print(df['20180709':'20180710'])

# loc: select by label
print(df.loc['20180711'])
print(df.loc[:,['B','C']])

# iloc : select by position
print(df.iloc[1:3, 2:4])
print(df.iloc[[0, 2], 2:4])

# ix : mixed selection
print(df.ix[[0, 2], ['B']])

# Boolean indexing
print(df[df.A > 3])

四、pandas設定資料值

首先選擇資料，然後直接通過指派表達式，即可将選擇的資料設定為相應的值

dates = pd.date_range('20180709', periods=3)
df = pd.DataFrame(np.arange(12).reshape((3, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
df.loc['20180709', 'B'] = 666
df.iloc[2, 2] = 999
df.ix['20180709', 3] = 777
df.A[df.A > 3] = 888
df['F'] = np.nan
print(df)

五、pandas處理NaN值

dropna(axis=, how=)：丢棄NaN資料，{axis：0(按行丢棄)，1(按列丢棄)} {how：'any'(隻要含有NaN資料就丢棄)，'all'(所有資料都為NaN時丢棄)}

fillna(value=)：将NaN值都設定為value的值

isnull()：對每各元素進行判斷是否是NaN，傳回結果矩陣

np.any(matrix) == value：判斷matrix矩陣中是否有value值

np.all(matrix) == value：判斷matrix矩陣中是否所有元素都是value值

dates = pd.date_range('20180709', periods=5)
df = pd.DataFrame(np.arange(20).reshape((5, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[3, 3] = np.nan
print(df.dropna(axis=1, how='all')) # how = {'any', 'all'}
print(df.fillna(value=666))
print(df.isnull())
print(np.any(df.isnull()) == True)
print(np.all(df.isnull()) == True)

六、pandas讀取資料、導出資料

根據資料的格式，pandas提供了多種資料讀取和導出的方法，如：

讀取資料：read_csv、read_table、read_fwf、read_clipboard、read_excel、read_hdf

導出資料：to_csv、to_table、to_fwf、to_clipboard、to_excel、to_hdf

df = pd.read_csv('Q1.csv')
print(df)
df.to_csv('Q1_pandas.csv')

七、pandas合并資料

concat方法

第一個參數：需要合并的矩陣

axis：合并次元，0：按行合并，1：按列合并

join：處理非公有列/行的方式，inner：去除非公有的列/行，outer：對非公有的列/行進行NaN值填充然後合并

ignore_index：是否重排行索引

df1 = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D'], index=[0, 1, 2])
df2 = pd.DataFrame(np.ones((3, 4)), columns=['B', 'C', 'D', 'E'], index=[1, 2, 3])

print(pd.concat([df1, df2], join='outer', ignore_index=True)) # join = {'outer', 'inner'}
print(pd.concat([df1, df2], axis=1, join_axes=[df1.index]))
print(df1.append([df2], ignore_index=True))

merge方法

第一個參數、第二個參數：需要合并的矩陣

on：公有列名

how：處理非公有行的方式，inner：去除非公有行，outer：對非公有的行進行NaN值填充然後合并，left：保留左矩陣的所有行，對非公有的元素進行NaN值填充，right：保留右邊矩陣的所有行，對非公有的元素進行NaN值填充

indicator：是否顯示每一行的merge方式

suffixes：非公有列的列名字尾

df1 = pd.DataFrame({
  'key':['K1', 'K2', 'K3'],
  'A':['A1', 'A2', 'A3'],
  'B':['B1', 'B2', 'B3']
})
df2 = pd.DataFrame({
  'key':['K1', 'K2', 'K3'],
  'C':['C1', 'C2', 'C3'],
  'D':['D1', 'D2', 'D3']
})
print(pd.merge(df1, df2, on='key'))
df3 = pd.DataFrame({
  'key1':['K1', 'K1', 'K0'],
  'key2':['K1', 'K0', 'K1'],
  'col':[1, 2, 3]
})
df4 = pd.DataFrame({
  'key1':['K0', 'K1', 'K0'],
  'key2':['K1', 'K0', 'K0'],
  'col':[6, 7, 8]
})
# how = {'inner', 'outer', 'left', 'right'}
print(pd.merge(df3, df4, on=['key1', 'key2'], how='right', suffixes=['_left', '_right'], indicator=True))

八、pandas資料可視化

pandas資料可視化依賴matplotlib庫，是以在可視化資料之前應該先導入該庫

import matplotlib.pyplot as plt

首先通過np.ramdom方法生成四列随機資料

然後通過cumsum對随機資料做累加

再通過scatter方法以其中兩列為綠色點X, Y的值，另兩列為藍色點X, Y的值

最後使用plt.show()方法畫圖

data = pd.DataFrame(np.random.randn(1000, 4),
  index=np.arange(1000),
  columns=list("ABCD"))
data = data.cumsum()
# plot methods:
# 'bar', 'hist', 'box', 'kde', 'area', 'scatter', 'hexbin', 'pie'
ax = data.plot.scatter(x='A', y='B', color='blue', label='class 1')
data.plot.scatter(x='C', y='D', color='green', label='class 2', ax=ax)
plt.show()

pandas入門指南

一、pandas的安裝及導入

二、建立pandas清單、矩陣及其屬性

三、pandas選擇資料

四、pandas設定資料值

五、pandas處理NaN值

六、pandas讀取資料、導出資料

七、pandas合并資料

八、pandas資料可視化

繼續閱讀

27. Remove Element(清單)題目代碼

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

samba伺服器的功能

【Linux】UDP廣播封包接收速率問題

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三

在python中建立excel并寫入