天天看點

Pandas詳解二之DataFrame對象DataFrame

約定

import pandas as pd
from pandas import DataFrame
import numpy as np
           

DataFrame

DataFrame是一個表格型的資料結構,既有行索引(儲存在index)又有列索引(儲存在columns)。

一、DataFrame對象常用屬性:

  • 建立DateFrame方法有很多(後面再介紹),最常用的是直接傳入一個由等長清單或Numpy組成的字典:
dict1={"Province":["Guangdong","Beijing","Qinghai","Fujiang"],
      "year":[]*,
      "pop":[,,,]}
df1=DataFrame(dict1)
df1
           

代碼結果:

Province pop year
Guangdong 1.3 2018
1 Beijing 2.5 2018
2 Qinghai 1.1 2018
3 Fujiang 0.7 2018
  • 同Series一樣,也可在建立時指定序列(對于字典中缺失的用NaN):
df2=DataFrame(dict1,columns=['year','Province','pop','debt'],index=['one','two','three','four'])
df2
           

代碼結果:

year Province pop debt
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN
  • 同Series一樣,DataFrame的index和columns有name屬性:
df2
           

代碼結果:

year Province pop debt
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN
df2.index.name='English'
df2.columns.name='Province'
df2
           

代碼結果:

Province year Province pop debt
English
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN
  • 通過shape屬性擷取DataFrame的行數和列數:
df2.shape
           

代碼結果:

(4, 4)
           
  • values屬性也會以二維ndarray的形式傳回DataFrame的資料:
df2.values
           

代碼結果:

array([[2018, 'Guangdong', 1.3, nan],
       [2018, 'Beijing', 2.5, nan],
       [2018, 'Qinghai', 1.1, nan],
       [2018, 'Fujiang', 0.7, nan]], dtype=object)
           
  • 列索引會作為DataFrame對象的屬性:
df2.Province
           

代碼結果:

English
one      Guangdong
two        Beijing
three      Qinghai
four       Fujiang
Name: Province, dtype: object
           

二、DataFrame對象常見存取、指派和删除方式:

  • DataFrame_object[ ] 能通過列索引來存取,當隻有一個标簽則傳回Series,多于一個則傳回DataFrame:
df2['Province']
           

代碼結果: English one Guangdong two Beijing three Qinghai four Fujiang Name: Province, dtype: object

df2[['Province','pop']]
           

代碼結果:

Province Province pop
English
one Guangdong 1.3
two Beijing 2.5
three Qinghai 1.1
four Fujiang 0.7
  • DataFrame_object.loc[ ] 能通過行索引來擷取指定行:
df2.loc['one']
           

代碼結果:

Province
year             2018
Province    Guangdong
pop               1.3
debt              NaN
Name: one, dtype: object
           
df2.loc['one':'three']
           

代碼結果:

Province year Province pop debt
English
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
  • 還可以擷取單值:
df2.loc['one','Province']
           

代碼結果:

'Guangdong'
           
  • DataFrame的列可以通過指派(一個值或一組值)來修改:
df2["debt"]=np.arange(,,)
df2
           

代碼結果:

Province year Province pop debt
English
one 2018 Guangdong 1.3 2.00
two 2018 Beijing 2.5 2.25
three 2018 Qinghai 1.1 2.50
four 2018 Fujiang 0.7 2.75
  • 為不存在的列指派會建立一個新的列,可通過del來删除:
df2['eastern']=df2.Province=='Guangdong'
df2
           

代碼結果:

Province year Province pop debt eastern
English
one 2018 Guangdong 1.3 2.00 True
two 2018 Beijing 2.5 2.25 False
three 2018 Qinghai 1.1 2.50 False
four 2018 Fujiang 0.7 2.75 False
del df2['eastern']
df2.columns
           

代碼結果:

Index(['year', 'Province', 'pop', 'debt'], dtype='object', name='Province')
           
  • 當然,還可以轉置:
df2.T
           
English one two three four
Province
year 2018 2018 2018 2018
Province Guangdong Beijing Qinghai Fujiang
pop 1.3 2.5 1.1 0.7
debt 2 2.25 2.5 2.75

三、多種建立DataFrame方式

  • 調用DataFrame()可以将多種格式的資料轉換為DataFrame對象,它的的三個參數data、index和columns分别為資料、行索引和列索引。data可以是:

1 二維數組

df3=pd.DataFrame(np.random.randint(,,(,)),index=[,,,],columns=['A','B','C','D'])
df3
           

代碼結果:

A B C D
1 9 8 4 6
2 5 7 7 4
3 6 3 2
4 4 6 9 8

2 字典

行索引由index決定,列索引由字典的鍵決定

dict1
           

代碼結果:

{'Province': ['Guangdong', 'Beijing', 'Qinghai', 'Fujiang'],
 'pop': [1.3, 2.5, 1.1, 0.7],
 'year': [2018, 2018, 2018, 2018]}
           
df4=pd.DataFrame(dict1,index=[,,,])
df4
           

代碼結果:

Province pop year
1 Guangdong 1.3 2018
2 Beijing 2.5 2018
3 Qinghai 1.1 2018
4 Fujiang 0.7 2018

3 結構數組

其中列索引由結構數組的字段名決定

arr=np.array([('item1',),('item2',),('item3',),('item4',)],dtype=[("name","10S"),("count",int)])
df5=pd.DataFrame(arr)
df5
           

代碼結果:

name count
b’item1’ 10
1 b’item2’ 20
2 b’item3’ 30
3 b’item4’ 40
  • 此外可以調用from_開頭的類方法,将特定的資料轉換為DataFrame對象。例如from_dict(),其orient參數指定字典鍵對應的方向,預設為”columns”:
dict2={"a":[,,],"b":[,,]}
df6=pd.DataFrame.from_dict(dict2)
df6
           

代碼結果:

a b
1 4
1 2 5
2 3 6
df7=pd.DataFrame.from_dict(dict2,orient="index")
df7
           

代碼結果:

1 2
a 1 2 3
b 4 5 6

四、将DataFrame對象轉換為其他格式的資料

  • to_dict()方法将DataFrame對象轉換為字典,參數orient決定字典元素的類型:
df7.to_dict()
           

代碼結果:

{0: {'a': 1, 'b': 4}, 1: {'a': 2, 'b': 5}, 2: {'a': 3, 'b': 6}}
           
df7.to_dict(orient="records")
           

代碼結果:

[{0: 1, 1: 2, 2: 3}, {0: 4, 1: 5, 2: 6}]
           
df7.to_dict(orient="list")
           

代碼結果:

{0: [1, 4], 1: [2, 5], 2: [3, 6]}
           
  • 類似的還有to_records()、to_csv()等

謝謝大家的浏覽,

希望我的努力能幫助到您,

共勉!