快速上手pandas(上)

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

pandas是一個靈活而強大的資料處理與資料分析工具集。它高度封裝了

NumPy

(高性能的N維數組運算庫)、

Matplotlib

(可視化工具)、檔案讀寫等等，廣泛應用于資料清洗、資料分析、資料挖掘等場景。

官網：https://pandas.pydata.org/

文檔：https://pandas.pydata.org/docs/

對

NumPy

完全不了解的朋友，建議翻閱前文：

https://www.cnblogs.com/bytesfly/p/numpy.html

In [1]:

# 這裡先導入下面會頻繁使用到的子產品
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 或者pd.show_versions()
pd.__version__

Out[1]:

'1.1.3'

剛接觸Python的朋友可能不知道

help()

指令能随時檢視幫助文檔, 這裡順便提一下：

In [2]:

# help(np.random)
# help(pd)
# help(plt)
# help(pd.DataFrame)

# help參數也可以傳入 執行個體對象的方法
# df = pd.DataFrame(np.random.randint(50, 100, (6, 5)))
# help(df.to_csv)

pandas資料結構

相關文檔：

https://pandas.pydata.org/docs/user_guide/dsintro.html

pandas中有三種資料結構，分别為：

Series

(一維資料結構)、

DataFrame

(二維表格型資料結構)和

MultiIndex

(三維資料結構)。

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

Series是一個類似于一維數組的資料結構，主要由一組資料(

data

)和與之相關的索引(

index

)兩部分構成。如下圖所示：

下面看如何建立Series：

不指定index，使用預設index(0-N)

In [3]:

s1 = pd.Series([12, 4, 7, 9])
s1

Out[3]:

0    12
1     4
2     7
3     9
dtype: int64

通過index來擷取資料：

In [4]:

s1[0]

Out[4]:

In [5]:

s1[3]

Out[5]:

指定index

In [6]:

s2 = pd.Series([12, 4, 7, 9], index=["a", "b", "c", "d"])
s2

Out[6]:

a    12
b     4
c     7
d     9
dtype: int64

通過index來擷取資料：

In [7]:

s2['a']

Out[7]:

In [8]:

s2['d']

Out[8]:

Series can be instantiated from dicts

In [9]:

s3 = pd.Series({"d": 12, "c": 4, "b": 7, "a": 9})
s3

Out[9]:

d    12
c     4
b     7
a     9
dtype: int64

When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and pandas version >= 0.23.

通過index來擷取資料：

In [10]:

s3['d']

Out[10]:

In [11]:

s3['a']

Out[11]:

Series也提供了兩個屬性index和values：

In [12]:

s1.index

Out[12]:

RangeIndex(start=0, stop=4, step=1)

In [13]:

s2.index

Out[13]:

Index(['a', 'b', 'c', 'd'], dtype='object')

In [14]:

s3.index

Out[14]:

Index(['d', 'c', 'b', 'a'], dtype='object')

In [15]:

s3.values

Out[15]:

array([12,  4,  7,  9])

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [16]:

pd.Series({"a": 0.0, "b": 1.0, "c": 2.0}, index=["b", "c", "d", "a"])

Out[16]:

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

NaN (not a number) is the standard missing data marker used in pandas.

注意：這裡的

NaN

是一個缺值辨別。

如果data是一個标量，那麼必須要傳入index：

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [17]:

pd.Series(5.0, index=["c", "a", "b"])

Out[17]:

c    5.0
a    5.0
b    5.0
dtype: float64

其實上面的建立相當于：

In [18]:

pd.Series([5.0, 5.0, 5.0], index=["c", "a", "b"])

Out[18]:

c    5.0
a    5.0
b    5.0
dtype: float64

ndarray上的一些操作對Series同樣适用：

Series is ndarray-like. Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index.

In [19]:

# 取s1前3個元素
s1[:3]

Out[19]:

0    12
1     4
2     7
dtype: int64

In [20]:

# 哪些元素大于10
s1 > 10

Out[20]:

0     True
1    False
2    False
3    False
dtype: bool

更多操作請翻閱上文對NumPy中的ndarray的講解。見：

https://www.cnblogs.com/bytesfly/p/numpy.html

In [21]:

s1.dtype

Out[21]:

dtype('int64')

If you need the actual array backing a Series, use Series.array

In [22]:

# Series轉為array
s1.array

Out[22]:

<PandasArray>
[12, 4, 7, 9]
Length: 4, dtype: int64

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy()

In [23]:

# Series轉為ndarray
s1.to_numpy()

Out[23]:

array([12,  4,  7,  9])

dict上的一些操作對Series也同樣适用：

Series is dict-like. A Series is like a fixed-size dict in that you can get and set values by index label

In [24]:

s3['a'] = 100
s3

Out[24]:

d     12
c      4
b      7
a    100
dtype: int64

In [25]:

"a" in s3

Out[25]:

True

In [26]:

"y" in s3

Out[26]:

False

In [27]:

# 擷取不到給預設值NAN
s3.get("y", np.nan)

Out[27]:

nan

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

DataFrame是一個類似于二維數組或表的對象，既有行索引，又有列索引。如下圖所示：

行索引(或者叫行标簽)，表明不同行，橫向索引，叫index，0軸，axis=0
列索引(或者叫列标簽)，表名不同列，縱向索引，叫columns，1軸，axis=1

下面看如何建立DataFrame：

不指定行、列标簽，預設使用0-N索引

In [28]:

# 随機生成6名學生，5門課程的分數
score = np.random.randint(50, 100, (6, 5))

# 建立DataFrame
pd.DataFrame(score)

Out[28]:

1	2	3	4
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

指定行、列标簽

In [29]:

# 列标簽
subjects = ["國文", "數學", "英語", "實體", "化學"]

# 行标簽
stus = ['學生' + str(i+1) for i in range(score.shape[0])]

# 建立DataFrame
score_df = pd.DataFrame(score, columns=subjects, index=stus)

score_df

Out[29]:

國文	數學	英語	實體	化學
學生1	69	90	56	97	79
學生2	57	98	70	57	82
學生3	63	66	98	78	63
學生4	74	58	75	57	68
學生5	94	78	83	72	73
學生6	60	73	62	72	79

加了行列标簽後，顯然資料可讀性更強了，一目了然。

同樣再看DataFrame的幾個基本屬性：

In [30]:

score_df.shape

Out[30]:

(6, 5)

In [31]:

score_df.columns

Out[31]:

Index(['國文', '數學', '英語', '實體', '化學'], dtype='object')

In [32]:

score_df.index

Out[32]:

Index(['學生1', '學生2', '學生3', '學生4', '學生5', '學生6'], dtype='object')

In [33]:

score_df.values

Out[33]:

array([[69, 90, 56, 97, 79],
       [57, 98, 70, 57, 82],
       [63, 66, 98, 78, 63],
       [74, 58, 75, 57, 68],
       [94, 78, 83, 72, 73],
       [60, 73, 62, 72, 79]])

In [34]:

# 轉置
score_df.T

Out[34]:

學生1	學生2	學生3	學生4	學生5	學生6
國文	69	57	63	74	94	60
數學	90	98	66	58	78	73
英語	56	70	98	75	83	62
實體	97	57	78	57	72	72
化學	79	82	63	68	73	79

In [35]:

# 顯示前3行内容
score_df.head(3)

Out[35]:

國文	數學	英語	實體	化學
學生1	69	90	56	97	79
學生2	57	98	70	57	82
學生3	63	66	98	78	63

In [36]:

# 顯示後3行内容
score_df.tail(3)

Out[36]:

國文	數學	英語	實體	化學
學生4	74	58	75	57	68
學生5	94	78	83	72	73
學生6	60	73	62	72	79

修改行标簽：

In [37]:

stus = ['stu' + str(i+1) for i in range(score.shape[0])]

score_df.index = stus

score_df

Out[37]:

國文	數學	英語	實體	化學
stu1	69	90	56	97	79
stu2	57	98	70	57	82
stu3	63	66	98	78	63
stu4	74	58	75	57	68
stu5	94	78	83	72	73
stu6	60	73	62	72	79

重置行标簽：

In [38]:

# drop預設為False,不删除原來的索引值
score_df.reset_index()

Out[38]:

index	國文	數學	英語	實體	化學
stu1	69	90	56	97	79
1	stu2	57	98	70	57	82
2	stu3	63	66	98	78	63
3	stu4	74	58	75	57	68
4	stu5	94	78	83	72	73
5	stu6	60	73	62	72	79

In [39]:

# drop為True, 則删除原來的索引值
score_df.reset_index(drop=True)

Out[39]:

國文	數學	英語	實體	化學
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

In [40]:

score_df

Out[40]:

國文	數學	英語	實體	化學
stu1	69	90	56	97	79
stu2	57	98	70	57	82
stu3	63	66	98	78	63
stu4	74	58	75	57	68
stu5	94	78	83	72	73
stu6	60	73	62	72	79

将某列值設定為新的索引：

set_index(keys, drop=True)

keys : 列索引名成或者列索引名稱的清單
drop : boolean, default True.當做新的索引，删除原來的列

In [41]:

hero_df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
                        'name': ['李尋歡', '令狐沖', '張無忌', '郭靖', '花無缺'],
                        'book': ['多情劍客無情劍', '笑傲江湖', '倚天屠龍記', '射雕英雄傳', '絕代雙驕'],
                        'skill': ['小李飛刀', '獨孤九劍', '九陽神功', '降龍十八掌', '移花接玉']})

hero_df.set_index('id')

Out[41]:

name	book	skill
id
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

設定多個索引，以

id

和

name

：

In [42]:

df = hero_df.set_index(['id', 'name'])
df

Out[42]:

book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

In [43]:

df.index

Out[43]:

MultiIndex([(1, '李尋歡'),
            (2, '令狐沖'),
            (3, '張無忌'),
            (4,  '郭靖'),
            (5, '花無缺')],
           names=['id', 'name'])

此時df就是一個具有

MultiIndex

的

DataFrame

。

MultiIndex

相關文檔：

https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

In [44]:

df.index.names

Out[44]:

FrozenList(['id', 'name'])

In [45]:

df.index.levels

Out[45]:

FrozenList([[1, 2, 3, 4, 5], ['令狐沖', '張無忌', '李尋歡', '花無缺', '郭靖']])

資料操作與運算

In [46]:

df

Out[46]:

book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌
5	花無缺	絕代雙驕	移花接玉

索引操作

直接使用行列索引(先列後行)：

In [47]:

df['skill']

Out[47]:

id  name
1   李尋歡      小李飛刀
2   令狐沖      獨孤九劍
3   張無忌      九陽神功
4   郭靖      降龍十八掌
5   花無缺      移花接玉
Name: skill, dtype: object

In [48]:

df['skill'][1]

Out[48]:

name
李尋歡    小李飛刀
Name: skill, dtype: object

In [49]:

df['skill'][1]['李尋歡']

Out[49]:

'小李飛刀'

使用 loc (指定行列索引的名字)

In [50]:

df.loc[1:3]

Out[50]:

book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功

In [51]:

df.loc[(2, '令狐沖'):(4, '郭靖')]

Out[51]:

book	skill
id	name
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功
4	郭靖	射雕英雄傳	降龍十八掌

In [52]:

df.loc[1:3, 'book']

Out[52]:

id  name
1   李尋歡     多情劍客無情劍
2   令狐沖        笑傲江湖
3   張無忌       倚天屠龍記
Name: book, dtype: object

In [53]:

df.loc[df.index[1:3], ['book', 'skill']]

Out[53]:

book	skill
id	name
2	令狐沖	笑傲江湖	獨孤九劍
3	張無忌	倚天屠龍記	九陽神功

使用iloc(通過索引的下标)

In [54]:

# 擷取前2行資料
df.iloc[:2]

Out[54]:

book	skill
id	name
1	李尋歡	多情劍客無情劍	小李飛刀
2	令狐沖	笑傲江湖	獨孤九劍

In [55]:

df.iloc[0:2, df.columns.get_indexer(['skill'])]

Out[55]:

skill
id	name
1	李尋歡	小李飛刀
2	令狐沖	獨孤九劍

指派操作

In [56]:

# 添加新列
df['score'] = 100
df['gender'] = 'male'

df

Out[56]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
5	花無缺	絕代雙驕	移花接玉	100	male

In [57]:

# 修改列值
df['score'] = 99

df

Out[57]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	99	male
2	令狐沖	笑傲江湖	獨孤九劍	99	male
3	張無忌	倚天屠龍記	九陽神功	99	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	99	male

In [58]:

# 或者這樣修改列值
df.score = 100

df

Out[58]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
5	花無缺	絕代雙驕	移花接玉	100	male

排序

sort_index

In [59]:

# 按索引降序
df.sort_index(ascending=False)

Out[59]:

book	skill	score	gender
id	name
5	花無缺	絕代雙驕	移花接玉	100	male
4	郭靖	射雕英雄傳	降龍十八掌	100	male
3	張無忌	倚天屠龍記	九陽神功	100	male
2	令狐沖	笑傲江湖	獨孤九劍	100	male
1	李尋歡	多情劍客無情劍	小李飛刀	100	male

sort_values

先把

score

設定為不同的值：

In [60]:

df['score'][1]['李尋歡'] = 80
df['score'][2]['令狐沖'] = 96
df['score'][3]['張無忌'] = 86
df['score'][4]['郭靖'] = 99
df['score'][5]['花無缺'] = 95

df

Out[60]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

In [61]:

# 按照score降序
df.sort_values(by='score', ascending=False)

Out[61]:

book	skill	score	gender
id	name
4	郭靖	射雕英雄傳	降龍十八掌	99	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
5	花無缺	絕代雙驕	移花接玉	95	male
3	張無忌	倚天屠龍記	九陽神功	86	male
1	李尋歡	多情劍客無情劍	小李飛刀	80	male

In [62]:

# 按照book名稱字元串長度 升序
df.sort_values(by='book', key=lambda col: col.str.len())

Out[62]:

book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
5	花無缺	絕代雙驕	移花接玉	95	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
1	李尋歡	多情劍客無情劍	小李飛刀	80	male

算術運算

In [63]:

# score+1
df['score'].add(1)

Out[63]:

id  name
1   李尋歡      81
2   令狐沖      97
3   張無忌      87
4   郭靖      100
5   花無缺      96
Name: score, dtype: int64

In [64]:

# score-1
df['score'].sub(1)

Out[64]:

id  name
1   李尋歡     79
2   令狐沖     95
3   張無忌     85
4   郭靖      98
5   花無缺     94
Name: score, dtype: int64

In [65]:

# 或者直接用 + - * / // %等運算符
(df['score'] + 1) % 10

Out[65]:

id  name
1   李尋歡     1
2   令狐沖     7
3   張無忌     7
4   郭靖      0
5   花無缺     6
Name: score, dtype: int64

邏輯運算

先回顧一下資料内容：

In [66]:

df

Out[66]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
3	張無忌	倚天屠龍記	九陽神功	86	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

邏輯運算結果：

In [67]:

df['score'] > 90

Out[67]:

id  name
1   李尋歡     False
2   令狐沖      True
3   張無忌     False
4   郭靖       True
5   花無缺      True
Name: score, dtype: bool

In [68]:

# 篩選出分數大于90的行
df[df['score'] > 90]

Out[68]:

book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male
5	花無缺	絕代雙驕	移花接玉	95	male

In [69]:

# 篩選出分數在85到90之間的行
df[(df['score'] > 85) & (df['score'] < 90)]

Out[69]:

book	skill	score	gender
id	name
3	張無忌	倚天屠龍記	九陽神功	86	male

In [70]:

# 篩選出分數在85以下或者95以上的行
df[(df['score'] < 85) | (df['score'] > 95)]

Out[70]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

或者通過

query()

函數實作上面的需求：

In [71]:

df.query("score<85 | score>95")

Out[71]:

book	skill	score	gender
id	name
1	李尋歡	多情劍客無情劍	小李飛刀	80	male
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

此外，可以用

isin(values)

函數來篩選指定的值，類似于

SQL

中

in

查詢：

In [72]:

df[df["score"].isin([99, 96])]

Out[72]:

book	skill	score	gender
id	name
2	令狐沖	笑傲江湖	獨孤九劍	96	male
4	郭靖	射雕英雄傳	降龍十八掌	99	male

統計運算

describe()

能夠直接得出很多統計結果：count, mean, std, min, max 等：

In [73]:

df.describe()

Out[73]:

score
count	5.000000
mean	91.200000
std	7.918333
min	80.000000
25%	86.000000
50%	95.000000
75%	96.000000
max	99.000000

In [74]:

# 使用統計函數：axis=0代表求列統計結果，1代表求行統計結果
df.max(axis=0, numeric_only=True)

Out[74]:

score    99
dtype: int64

其他幾個常用的聚合函數都類似。不再一一舉例。

下面重點看下累計統計函數。

函數	作用
cumsum	計算前1/2/3/…/n個數的和
cummax	計算前1/2/3/…/n個數的最大值
cummin	計算前1/2/3/…/n個數的最小值
cumprod	計算前1/2/3/…/n個數的積

下面是某公司近半年以來的各部門的營業收入資料：

In [75]:

income = pd.DataFrame(data=np.random.randint(60, 100, (6, 5)),
                      columns=['group' + str(x) for x in range(1, 6)],
                      index=['Month' + str(x) for x in range(1, 7)])

income

Out[75]:

group1	group2	group3	group4	group5
Month1	97	89	62	82	71
Month2	68	69	82	66	79
Month3	77	87	66	94	82
Month4	69	76	99	79	61
Month5	77	94	76	70	70
Month6	89	64	92	63	60

統計

group1

的前N個月的總營業收入：

In [76]:

group1_income = income['group1']

group1_income.cumsum()

Out[76]:

Month1     97
Month2    165
Month3    242
Month4    311
Month5    388
Month6    477
Name: group1, dtype: int64

用圖形展示會更加直覺：

In [77]:

group1_income.cumsum().plot(figsize=(8, 5))

plt.show()

同理，統計

group1

的前N個月的最大營業收入：

In [78]:

group1_income.cummax().plot(figsize=(8, 5))

plt.show()

其他運算

先看下近半年以來前3個部門的營業收入資料：

In [79]:

income[['group1', 'group2', 'group3']]

Out[79]:

group1	group2	group3
Month1	97	89	62
Month2	68	69	82
Month3	77	87	66
Month4	69	76	99
Month5	77	94	76
Month6	89	64	92

In [80]:

# 近半年 前3個部門 每月營業收入極差
income[['group1', 'group2', 'group3']].apply(lambda x: x.max() - x.min(), axis=1)

Out[80]:

Month1    35
Month2    14
Month3    21
Month4    30
Month5    18
Month6    28
dtype: int64

In [81]:

# 近半年 前3個部門 每個部門營業收入極差
income[['group1', 'group2', 'group3']].apply(lambda x: x.max() - x.min(), axis=0)

Out[81]:

group1    29
group2    30
group3    37
dtype: int64

檔案讀寫

相關文檔：

https://pandas.pydata.org/docs/user_guide/io.html

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.to_csv(). Below is a table containing available readers and writers.

Format Type	Data Description	Reader	Writer
text	CSV	read_csv	to_csv
text	Fixed-Width Text File	read_fwf
text	JSON	read_json	to_json
text	HTML	read_html	to_html
text	Local clipboard	read_clipboard	to_clipboard
binary	MS Excel	read_excel	to_excel
binary	OpenDocument	read_excel
binary	HDF5 Format	read_hdf	to_hdf
binary	Feather Format	read_feather	to_feather
binary	Parquet Format	read_parquet	to_parquet
binary	ORC Format	read_orc
binary	Msgpack	read_msgpack	to_msgpack
binary	Stata	read_stata	to_stata
binary	SAS	read_sas
binary	SPSS	read_spss
binary	Python Pickle Format	read_pickle	to_pickle
SQL	SQL	read_sql	to_sql
SQL	Google BigQuery	read_gbq	to_gbq

CSV

這裡用下面網址的csv資料來做一些測試：

https://www.stats.govt.nz/large-datasets/csv-files-for-download/

In [82]:

path = "https://www.stats.govt.nz/assets/Uploads/Employment-indicators/Employment-indicators-Weekly-as-at-24-May-2021/Download-data/Employment-indicators-weekly-paid-jobs-20-days-as-at-24-May-2021.csv"
# 讀csv
data = pd.read_csv(path, sep=',', usecols=['Week_end', 'High_industry', 'Value'])
data

Out[82]:

Week_end	High_industry	Value
2019-05-05	Total	1828160.00
1	2019-05-05	A Primary	79880.00
2	2019-05-05	B Goods Producing	344320.00
3	2019-05-05	C Services	1389220.00
4	2019-05-05	Z No Match	14730.00
...	...	...	...
2095	2021-05-02	Total	700.94
2096	2021-05-02	A Primary	680.20
2097	2021-05-02	B Goods Producing	916.42
2098	2021-05-02	C Services	649.92
2099	2021-05-02	Z No Match	425.65

2100 rows × 3 columns

In [83]:

# 寫csv
data[:20].to_csv("./test.csv", columns=['Week_end', 'Value'],
                 header=True, index=False, mode='w')

輸出如下：

Week_end,Value
2019-05-05,1828160.0
2019-05-05,79880.0
2019-05-05,344320.0

JSON

更多Json格式資料，google關鍵詞

site:api.androidhive.info

In [84]:

path = "https://api.androidhive.info/json/movies.json"
# 讀json
data = pd.read_json(path)
data = data.loc[:2, ['title', 'rating']]
data

Out[84]:

title	rating
Dawn of the Planet of the Apes	8.3
1	District 9	8.0
2	Transformers: Age of Extinction	6.3

records

In [85]:

data.to_json("./test.json", orient='records', lines=True)

輸出如下：

{"title":"Dawn of the Planet of the Apes","rating":8.3}
{"title":"District 9","rating":8.0}
{"title":"Transformers: Age of Extinction","rating":6.3}

如果

lines=False

，即：

In [86]:

data.to_json("./test.json", orient='records', lines=False)

輸出如下：

[
    {
        "title":"Dawn of the Planet of the Apes",
        "rating":8.3
    },
    {
        "title":"District 9",
        "rating":"8.0"
    },
    {
        "title":"Transformers: Age of Extinction",
        "rating":6.3
    }
]

columns

In [87]:

data.to_json("./test.json", orient='columns')

輸出如下：

{
    "title":{
        "0":"Dawn of the Planet of the Apes",
        "1":"District 9",
        "2":"Transformers: Age of Extinction"
    },
    "rating":{
        "0":8.3,
        "1":"8.0",
        "2":6.3
    }
}

index

In [88]:

data.to_json("./test.json", orient='index')

輸出如下：

{
    "0":{
        "title":"Dawn of the Planet of the Apes",
        "rating":8.3
    },
    "1":{
        "title":"District 9",
        "rating":"8.0"
    },
    "2":{
        "title":"Transformers: Age of Extinction",
        "rating":6.3
    }
}

split

In [89]:

data.to_json("./test.json", orient='split')

輸出如下：

{
    "columns":[
        "title",
        "rating"
    ],
    "index":[
        0,
        1,
        2
    ],
    "data":[
        [
            "Dawn of the Planet of the Apes",
            8.3
        ],
        [
            "District 9",
            "8.0"
        ],
        [
            "Transformers: Age of Extinction",
            6.3
        ]
    ]
}

values

In [90]:

data.to_json("./test.json", orient='values')

輸出如下：

[
    [
        "Dawn of the Planet of the Apes",
        8.3
    ],
    [
        "District 9",
        "8.0"
    ],
    [
        "Transformers: Age of Extinction",
        6.3
    ]
]

Excel

In [91]:

# 讀Excel
team = pd.read_excel('https://www.gairuo.com/file/data/dataset/team.xlsx')

team.head(5)

Out[91]:

name	team	Q1	Q2	Q3	Q4
Liver	E	89	21	24	64
1	Arry	C	36	37	37	57
2	Ack	A	57	60	18	84
3	Eorge	C	93	96	71	78
4	Oah	D	65	49	61	86

In [92]:

# 末尾添加一列sum, 值為Q1、Q2、Q3、Q4列的和
team['sum'] = team['Q1'] + team['Q2'] + team['Q3'] + team['Q4']

team.head(5)

Out[92]:

name	team	Q1	Q2	Q3	Q4	sum
Liver	E	89	21	24	64	198
1	Arry	C	36	37	37	57	167
2	Ack	A	57	60	18	84	219
3	Eorge	C	93	96	71	78	338
4	Oah	D	65	49	61	86	261

In [93]:

# 寫Excel
team.to_excel('test.xlsx', index=False)

HDF5

HDF5（

Hierarchical Data Format

）是用于存儲大規模數值資料的較為理想的存儲格式。

優勢：

HDF5在存儲的時候支援壓縮，進而提磁盤使用率，節省空間
HDF5跨平台的，可輕松遷移到Hadoop上

In [94]:

score = pd.DataFrame(np.random.randint(50, 100, (100000, 10)))

score.head()

Out[94]:

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

In [95]:

# 寫HDF5
score.to_hdf("./score.h5", key="score", complevel=9, mode='w')

In [96]:

# 讀HDF5
new_score = pd.read_hdf("./score.h5", key="score")

new_score.head()

Out[96]:

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

注意：HDF5檔案的讀取和存儲需要指定一個鍵(

key

)，值為要存儲的

DataFrame

。

In [97]:

score.to_csv("./score.csv", mode='w')

同時，來對比一下寫HDF5與寫csv的占用磁盤情況：

-rw-r--r--  1 wind  staff   3.4M  6 13 10:24 score.csv
-rw-r--r--  1 wind  staff   763K  6 13 10:23 score.h5

其他格式的檔案讀寫也都類似，這裡不再舉例說明。

畫圖

下面各圖表的适用場景參考：

https://antv-2018.alipay.com/zh-cn/vis/chart/index.html

http://tuzhidian.com/

Pandas的

DataFrame

和

Series

，在

matplotlib

基礎上封裝了一個簡易的繪圖函數, 使得在資料處理過程中可以友善快速可視化資料。

相關文檔：

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html

折線圖 line

折線圖用于分析事物随時間或有序類别而變化的趨勢。

先來個快速入門的案例，繪制

sin(x)

的一個周期内

(0, 2π)

的函數圖像：

In [98]:

# 中文字型
plt.rc('font', family='Arial Unicode MS')
plt.rc('axes', unicode_minus='False')

x = np.arange(0, 6.29, 0.01)
y = np.sin(x)

s = pd.Series(y, index=x)
s.plot(kind='line', title='sin(x)圖像',
       style='--g', grid=True, figsize=(8, 6))

plt.show()

注意：中文顯示有問題的，可用如下代碼檢視系統可用字型。

In [99]:

from matplotlib.font_manager import FontManager
# fonts = set([x.name for x in FontManager().ttflist])
# print(fonts)

如果要繪制

DataFrame

資料的某2列資料的折線圖，傳入x='列1'，y='列2'，就能得到以'列1'為x軸，'列2'為y軸的線型圖。如果沒有指明x，x軸預設用

index

。如下例子：

In [100]:

df = pd.DataFrame({'x': x, 'y': y})
df.plot(x='x', y='y', kind='line', ylabel='y=sin(x)',
        style='--g', grid=True, figsize=(8, 6))

plt.show()

再看一個實際的案例。下面是北京、上海、合肥三個城市某天中午氣溫的資料。

In [101]:

x = [f'12點{i}分' for i in range(60)]

y_beijing = np.random.uniform(18, 23, len(x))
y_shanghai = np.random.uniform(23, 26, len(x))
y_hefei = np.random.uniform(21, 28, len(x))

df = pd.DataFrame({'x': x, 'beijing': y_beijing, 'shanghai': y_shanghai, 'hefei': y_hefei})

df.head(5)

Out[101]:

x	beijing	shanghai	hefei
12點0分	21.516230	24.627902	25.232748
1	12點1分	22.601501	25.982331	26.403320
2	12點2分	19.739455	23.602787	25.569943
3	12點3分	21.741182	25.102164	24.400619
4	12點4分	19.888968	23.995114	22.879671

繪制北京、上海、合肥三個城市氣溫随時間變化的情況如下：

In [102]:

df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

添加參數

subplots=True

, 可以形成多個子圖，如下：

In [103]:

df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        subplots=True, figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

另外，參數

layout=(m,n)

可以指明子圖的行列數，期中

m*n

的值要大于子圖的數量，如下：

In [104]:

df.plot(x='x', y=['beijing', 'shanghai', 'hefei'], kind='line',
        subplots=True, layout=(2, 2), figsize=(12, 6), xlabel='時間', ylabel='溫度')
plt.show()

柱狀圖 bar

柱狀圖最适合對分類的資料進行比較。

武林大會上，每個英雄參加10個回合的比拼，每人勝局(

wins

)、平局(

draws

)、敗局(

losses

)的統計如下：

In [105]:

columns = ['hero', 'wins', 'draws', 'losses']

score = [
    ['李尋歡', 6, 1, 3],
    ['令狐沖', 5, 4, 1],
    ['張無忌', 5, 3, 2],
    ['郭靖', 4, 5, 1],
    ['花無缺', 5, 2, 3]
]

df = pd.DataFrame(score, columns=columns)

df

Out[105]:

hero	wins	draws	losses
李尋歡	6	1	3
1	令狐沖	5	4	1
2	張無忌	5	3	2
3	郭靖	4	5	1
4	花無缺	5	2	3

In [106]:

df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        rot=0, title='柱狀圖', xlabel='', figsize=(10, 6))
plt.show()

添加

stacked=True

可以繪制堆疊柱狀圖，如下：

In [107]:

df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        stacked=True, rot=0, title='堆疊柱狀圖', xlabel='', figsize=(10, 6))
plt.show()

另外使用

plot(kind='barh')

或者

plot.barh()

可以繪制水準柱狀圖。下面

rot

參數設定x标簽的旋轉角度，

alpha

設定透明度，

align

設定對齊位置。

In [108]:

df.plot(kind='barh', x='hero', y=['wins'], xlabel='',
        align='center', alpha=0.8, rot=0, title='水準柱狀圖', figsize=(10, 6))
plt.show()

同樣添加參數

subplots=True

，可以形成多個子圖，如下：

In [109]:

df.plot(kind='bar', x='hero', y=['wins', 'draws', 'losses'],
        subplots=True, rot=0, xlabel='', figsize=(10, 8))
plt.show()

餅圖 pie

餅圖最顯著的功能在于表現“占比”。

每當某些機構或者平台釋出程式設計語言排行榜以及市場占有率時，相信行業内的很多朋友會很自然地瞄上幾眼。

這裡舉個某網際網路公司研發部使用的後端程式設計語言占比：

In [110]:

colors = ['#FF6600', '#0099FF', '#FFFF00', '#FF0066', '#339900']
language = ['Java', 'Python', 'Golang', 'Scala', 'Others']

s = pd.Series([0.5, 0.3, 0.1, 0.06, 0.04], name='language',
              index=language)

s.plot(kind='pie', figsize=(7, 7),
       autopct="%.0f%%", colors=colors)
plt.show()

散點圖 scatter

散點圖适用于分析變量之間是否存在某種關系或相關性。

先造一些随機資料，如下：

In [111]:

x = np.random.uniform(0, 100, 100)
y = [2*n for n in x]

df = pd.DataFrame({'x': x, 'y': y})
df.head(5)

Out[111]:

x	y
10.405567	20.811134
1	24.520765	49.041530
2	32.735258	65.470516
3	90.868823	181.737646
4	21.875188	43.750377

繪制散點圖，如下：

In [112]:

df.plot(kind='scatter', x='x', y='y', figsize=(12, 6))
plt.show()

從圖中，可以看出y與x可能存在正相關的關系。

直方圖 hist

直方圖用于表示資料的分布情況。一般用橫軸表示資料區間，縱軸表示分布情況，柱子越高，則落在該區間的數量越大。

建構直方圖，首先要确定“組距”、對數值的範圍進行分區，通俗的說即是劃定有幾根柱子（例如0-100分，每隔20分劃一個區間，共5個區間）。接着，對落在每個區間的數值進行頻次計算（如落在80-100分的10人，60-80分的20人，以此類推）。最後，繪制矩形，高度由頻數決定。

注意：直方圖并不等于柱狀圖，不能對離散的分類資料進行比較。

在前文快速上手NumPy 中，簡單講過正态分布，也畫過正态分布的直方圖。下面看用pandas如何畫直方圖：

已知某地區成年男性身高近似服從正态分布。下面生成均值為170，标準差為5的100000個符合正态分布規律的樣本資料。

In [113]:

height = np.random.normal(170, 5, 100000)

df = pd.DataFrame({'height': height})
df.head(5)

Out[113]:

height
166.547946
1	166.847060
2	166.887866
3	175.607073
4	181.527058

繪制直方圖，其中分組數為100：

In [114]:

df.plot(kind='hist', bins=100, figsize=(10, 5))

plt.grid(True, linestyle='--', alpha=0.8)
plt.show()

從圖中可以直覺地看出，大多人身高集中在170左右。

箱形圖 box

箱形圖多用于數值統計，它不需要占據過多的畫布空間，空間使用率高，非常适用于比較多組資料的分布情況。通過箱形圖，可以很快知道一些關鍵的統計值，如最大值、最小值、中位數、上下四分位數等等。

某班級30個學生在期末考試中，國文、數學、英語、實體、化學5門課的成績資料如下：

In [115]:

count = 30

chinese_score = np.random.normal(80, 10, count)
maths_score = np.random.normal(85, 20, count)
english_score = np.random.normal(70, 25, count)
physics_score = np.random.normal(65, 30, count)
chemistry_score = np.random.normal(75, 5, count)

scores = pd.DataFrame({'Chinese': chinese_score, 'Maths': maths_score, 'English': english_score,
                       'Physics': physics_score, 'Chemistry': chemistry_score})

scores.head(5)

Out[115]:

Chinese	Maths	English	Physics	Chemistry
81.985202	77.139479	78.881483	98.688823	75.849040
1	71.597557	69.960533	98.784664	72.140422	73.179419
2	80.581071	75.501743	99.491803	18.579709	67.963091
3	82.396994	113.018430	83.224544	38.406359	75.713590
4	83.153492	103.378598	59.399535	78.381277	77.020693

繪制箱形圖如下：

In [116]:

scores.plot(kind='box', figsize=(10, 6))

plt.grid(True, linestyle='--', alpha=1)
plt.show()

添加參數

vert=False

将箱形圖橫向展示：

In [117]:

scores.plot(kind='box', vert=False, figsize=(10, 5))

plt.grid(True, linestyle='--', alpha=1)
plt.show()

添加參數

subplots=True

，可以形成多個子圖：

In [118]:

scores.plot(kind='box', y=['Chinese', 'Maths', 'English', 'Physics'],
            subplots=True, layout=(2, 2), figsize=(10, 8))

plt.show()

面積圖 area

面積圖，或稱區域圖，是一種随有序變量的變化，反映數值變化的統計圖表，原理與折線圖相似。而面積圖的特點在于，折線與自變量坐标軸之間的區域，會由顔色或者紋理填充。

适用場景：

在連續自變量下，一組或多組資料的趨勢變化以及互相之間的對比，同時也能夠觀察到資料總量的變化趨勢。

例如，位移 = 速度 x 時間：即

s=v*t

; 那麼x 軸是時間 t，y 軸是每個時刻的速度 v，使用面積圖，不僅可以觀察速度随時間變化的趨勢，還可以根據面積大小來感受位移距離的長度變化。

秋名山某段直線賽道上，

AE86

與

GC8

在60秒時間内的車速與時間的變化資料如下：

In [119]:

t = list(range(60))

v_AE86 = np.random.uniform(180, 210, len(t))
v_GC8 = np.random.uniform(190, 230, len(t))

v = pd.DataFrame({'t': t, 'AE86': v_AE86, 'GC8': v_GC8})

v.head(5)

Out[119]:

t	AE86	GC8
198.183060	215.830409
1	1	190.186453	195.316343
2	2	180.464073	210.641824
3	3	194.842767	219.794681
4	4	194.620050	204.215492

面積圖預設情況下是堆疊的。

In [120]:

v.plot(kind='area', x='t', y=['AE86', 'GC8'], figsize=(10, 5))
plt.show()

要生成未堆積的圖，可以傳入參數

stacked=False

：

In [121]:

v.plot(kind='area', x='t', y=['AE86', 'GC8'],
       stacked=False, figsize=(10, 5), alpha=0.4)
plt.show()

從圖形與x軸圍成的面積看，很顯然該60秒内，

GC8

是略領先

AE86

的。溫馨提示，文明駕駛，切勿飙車！

同樣，添加參數

subplots=True

，可以形成多個子圖，如下：

In [122]:

v.plot(kind='area', x='t', y=['AE86', 'GC8'],
       subplots=True, figsize=(10, 5), rot=0)
plt.show()

總結

本文是快速上手pandas的上篇，先介紹了為什麼選用pandas，接下來介紹pandas的常用資料結構Series、DataFrame以及一系列的操作與運算，然後嘗試了用pandas來讀寫檔案，最後重點介紹了如何使用pandas快速便捷地可視化資料。

在快速上手pandas的下篇中，會涉及如何用pandas來進行資料清洗、資料合并、分組聚合、資料透視、文本處理等，敬請期待。

1	2	3	4
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

國文	數學	英語	實體	化學
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

1	2	3	4
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

國文	數學	英語	實體	化學
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

快速上手pandas(上)

pandas資料結構

Series

DataFrame

MultiIndex

資料操作與運算

索引操作

指派操作

排序

算術運算

邏輯運算

統計運算

其他運算

檔案讀寫

CSV

JSON

Excel

HDF5

畫圖

折線圖 line

柱狀圖 bar

餅圖 pie

散點圖 scatter

直方圖 hist

箱形圖 box

面積圖 area

總結

繼續閱讀

1	2	3	4
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

國文	數學	英語	實體	化學
69	90	56	97	79
1	57	98	70	57	82
2	63	66	98	78	63
3	74	58	75	57	68
4	94	78	83	72	73
5	60	73	62	72	79

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93

1	2	3	4	5	6	7	8	9
76	53	51	94	84	77	56	82	56	50
1	93	59	84	83	77	67	52	52	53	62
2	96	98	88	72	96	64	58	67	89	95
3	57	75	89	73	72	73	58	93	72	92
4	50	50	52	57	72	76	78	52	90	93