天天看點

【翻譯】Pandas 十分鐘入門建立對象可視化資料選擇資料缺失資料處理資料操作合并(merge)分組(grouping)重塑(reshape)透視表(Pivot table)時間序列(Time Series)Categoricals畫圖檔案輸入輸出擷取資料(Getting Data In/Out)HDF5Gotchas 什麼鬼?

這是一個簡短的介紹pandas用法,主要面向新使用者。 在Cookbook你可以看到更複雜的方法。

通常,我們導入以下子產品:

In []: import pandas as pd

In []: import numpy as np

In []: import matplotlib.pyplot as plt
           

建立對象

建立一個Series對象:

In []: s = pd.Series([,,,np.nan,,])

In []: s
Out[]: 
    
    
    
    NaN
    
    
dtype: float64
           

通過numpy數組建立一個DateFrame對象,包括索引和列标簽:

In []: dates = pd.date_range('20130101', periods=)

In []: dates
Out[]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In []: df = pd.DataFrame(np.random.randn(,), index=dates, columns=list('ABCD'))

In []: df
Out[]: 
                   A         B         C         D
--   - - -
--   -   -
-- - - -  
--   - -  
-- -     -
-- -   -  
           

通過字典方式建立DataFrame對象:

In []: df2 = pd.DataFrame({ 'A' : ,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(,index=list(range()),dtype='float32'),
   ....:                      'D' : np.array([] * ,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
   ....: 

In []: df2
Out[]: 
     A          B    C  D      E    F
   --       test  foo
   --      train  foo
   --       test  foo
   --      train  foo
           

檢視各列的類型:

In []: df2.dtypes
Out[]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
           

可視化資料

檢視首尾行數:

In []: df.head()
Out[]: 
                   A         B         C         D
--   - - -
--   -   -
-- - - -  
--   - -  
-- -     -

In []: df.tail()
Out[]: 
                   A         B         C         D
--   - -  
-- -     -
-- -   -  
           

顯示索引,列标簽和底層numpy資料:

In []: df.index
Out[]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In []: df.columns
Out[]: Index([u'A', u'B', u'C', u'D'], dtype='object')

In []: df.values
Out[]: 
array([[ , -, -, -],
       [ , -,  , -],
       [-, -, -,  ],
       [ , -, -,  ],
       [- ,   ,  , -],
       [-,  , -,   ]])
           

describe方法顯示資料的快速統計彙總結果:

In []: df.describe()
Out[]: 
              A         B         C         D
count        
mean    - - -
std          
min   - - - -
%   - - - -
%     - - -
%       -  
max          
           

轉置資料:

In []: df.T
Out[]: 
   --  --  --  --  --  --
A           -       -   -
B   -   -   -   -        
C   -       -   -       -
D   -   -           -    
           

按索引排序:

In []: df.sort_index(axis=, ascending=False)
Out[]: 
                   D         C         B         A
-- - - -  
-- -   -  
--   - - -
--   - -  
-- -     -
--   -   -
           

按指定列的值排序:

In []: df.sort_values(by='B')
Out[]: 
                   A         B         C         D
-- - - -  
--   - -  
--   - - -
--   -   -
-- -   -  
-- -     -
           

選擇資料

Note:标準Python/Numpy的資料選擇和設定很直覺和友善,但是在生産環境,我們推薦優化的pandas方法,如at, .iat, .loc, .iloc 和 .ix

Geting資料

選擇一列資料,傳回Series資料類型,和 df.A 指令等價:

In []: df['A']
Out[]: 
--    
--    
--   -
--    
--   -
--   -
Freq: D, Name: A, dtype: float64
           

通過 [] 選擇行資料:

In []: df[:]
Out[]: 
                   A         B         C         D
--   - - -
--   -   -
-- - - -  

In []: df['20130102':'20130104']
Out[]: 
                   A         B         C         D
--   -   -
-- - - -  
--   - -  
           

列标簽選擇資料

通過date索引擷取一個橫截面(cross section)資料:

In [26]: df.loc[dates[0]]
Out[26]: 
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 2013-01-01 00:00:00, dtype: float64
           

多個标簽擷取資料:

In []: df.loc[:,['A','B']]
Out[]: 
                   A         B
--   -
--   -
-- - -
--   -
-- -  
-- -  
           

切片資料:

In []: df.loc['20130102':'20130104',['A','B']]
Out[]: 
                   A         B
--   -
-- - -
--   -
           

在切片資料上精簡次元:

In [29]: df.loc['20130102',['A','B']]
Out[29]: 
A    1.212112
B   -0.173215
Name: 2013-01-02 00:00:00, dtype: float64
           

擷取一個标量資料:

In []: df.loc[dates[],'A']
Out[]: 
           

一個更快速擷取标量資料的方法(和上一個方法等同):

In []: df.at[dates[],'A']
Out[]: 
           

通過位置擷取資料

通過傳遞一個整數值定位:

In [32]: df.iloc[3]
Out[32]: 
A    0.721555
B   -0.706771
C   -1.039575
D    0.271860
Name: 2013-01-04 00:00:00, dtype: float64
           

類似Numpy/python,通過切片定位:

In [33]: df.iloc[3:5,0:2]
Out[33]: 
                   A         B
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020
           

通過整數清單定位:

In []: df.iloc[[1,2,4],[0,2]]
Out[]: 
                   A         C
--    
-- - -
-- -  
           

指定行切片:

In []: df.iloc[:,:]
Out[]: 
                   A         B         C         D
--   -   -
-- - - -  
           

指定列切片:

In [36]: df.iloc[:,1:3]
Out[36]: 
                   B         C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215  0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05  0.567020  0.276232
2013-01-06  0.113648 -1.478427
           

通過位置擷取值:

In [37]: df.iloc[1,1]
Out[37]: -0.17321464905330858
           

類似的iat方法:

In [38]: df.iat[1,1]
Out[38]: -0.17321464905330858
           

布爾索引

通過單列值取數:

In []: df[df.A > ]
Out[]: 
                   A         B         C         D
--   - - -
--   -   -
--   - -  
           

一個where操作取值:

In []: df[df > ]
Out[]: 
                   A         B         C         D
--         NaN       NaN       NaN
--         NaN         NaN
--       NaN       NaN       NaN  
--         NaN       NaN  
--       NaN           NaN
--       NaN         NaN  
           

isin()方法:

In []: df2 = df.copy()

In []: df2['E'] = ['one', 'one','two','three','four','three']

In []: df2
Out[]: 
                   A         B         C         D      E
--   - - -    one
--   -   -    one
-- - - -      two
--   - -    three
-- -     -   four
-- -   -    three

In []: df2[df2['E'].isin(['two','four'])]
Out[]: 
                   A         B         C         D     E
-- - - -     two
-- -     -  four
           

指派

相同索引指派一列資料:

In []: s1 = pd.Series([,,,,,], index=pd.date_range('20130102', periods=))

In []: s1
Out[]: 
--    
--    
--    
--    
--    
--    
Freq: D, dtype: int64

In []: df['F'] = s1
           

通過标簽指派:

In []: df.at[dates[],'A'] = 
           

位置指派:

In []: df.iat[,] = 
           

numpy數組指派:

In []: df.loc[:,'D'] = np.array([] * len(df))
           

前面操作的結果展示:

In []: df
Out[]: 
                   A         B         C  D    F
--     -    NaN
--   -      
-- - - -    
--   - -    
-- -        
-- -   -    
           

where操作指派:

In []: df2 = df.copy()

In []: df2[df2 > ] = -df2

In []: df2
Out[]: 
                   A         B         C  D    F
--     - -  NaN
-- - - - - -
-- - - - - -
-- - - - - -
-- - - - - -
-- - - - - -
           

缺失資料處理

pandas 主要用np.nan表示缺失資料,預設不列入計算。

reindex方法允許在指定的軸上增/删/改原索引,傳回一個副本:

In []: df1 = df.reindex(index=dates[:], columns=list(df.columns) + ['E'])

In []: df1.loc[dates[]:dates[],'E'] = 

In []: df1
Out[]: 
                   A         B         C  D    F    E
--     -    NaN  
--   -        
-- - - -      NaN
--   - -      NaN
           

删除有缺失值的行:

In []: df1.dropna(how='any')
Out[]: 
                   A         B         C  D    F    E
--   -        
           

在缺失值位置填充:

In []: df1.fillna(value=)
Out[]: 
                   A         B         C  D    F    E
--     -      
--   -        
-- - - -      
--   - -      
           

判斷是否缺失,傳回布爾集:

In []: pd.isnull(df1)
Out[]: 
                A      B      C      D      F      E
--  False  False  False  False   True  False
--  False  False  False  False  False  False
--  False  False  False  False  False   True
--  False  False  False  False  False   True
           

資料操作

Operations 通常排除缺失資料

描述統計:

In []: df.mean()
Out[]: 
A   -
B   -
C   -
D    
F    
dtype: float64
           

同樣操作在标簽次元:

In []: df.mean()
Out[]: 
--    
--    
--    
--    
--    
--    
Freq: D, dtype: float64
           

pandas操作不同次元的資料需要對齊,另外它會按指定的次元方向計算:

In []: s = pd.Series([,,,np.nan,,], index=dates).shift()

In []: s
Out[]: 
--    NaN
--    NaN
--    
--    
--    
--    NaN
Freq: D, dtype: float64

In []: df.sub(s, axis='index')
Out[]: 
                   A         B         C    D    F
--       NaN       NaN       NaN  NaN  NaN
--       NaN       NaN       NaN  NaN  NaN
-- - - -    
-- - - -    
-- - - -   -
--       NaN       NaN       NaN  NaN  NaN
           

apply方法

applying 函數:

In []: df.apply(np.cumsum)
Out[]: 
                   A         B         C   D     F
--     -      NaN
--   - -     
--   - -     
--   - -     
--   - -    
-- - - -    

In []: df.apply(lambda x: x.max() - x.min())
Out[]: 
A    
B    
C    
D    
F    
dtype: float64
           

直方圖

In []: s = pd.Series(np.random.randint(, , size=))

In []: s
Out[]: 
    
    
    
    
    
    
    
    
    
    
dtype: int64

In []: s.value_counts()
Out[]: 
    
    
    
    
dtype: int64
           

字元串方法

Series中的字元處理方法和Python中的str方法一樣。另外str方法預設在模式比對的時候預設使用正則表達。

In []: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In []: s.str.lower()
Out[]: 
       a
       b
       c
    aaba
    baca
     NaN
    caba
     dog
     cat
dtype: object
           

合并(merge)

concat方法:

In []: df = pd.DataFrame(np.random.randn(, ))

In []: df
Out[]: 
                                     
 -   - -
   - - -
 -   -  
 -     -
 - -   -
        
 - -   -
 -     -
 -   -  
   - - -

# break it into pieces
In []: pieces = [df[:], df[:], df[:]]

In []: pd.concat(pieces)
Out[]: 
                                     
 -   - -
   - - -
 -   -  
 -     -
 - -   -
        
 - -   -
 -     -
 -   -  
   - - -
           

join方法

In []: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [, ]})

In []: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [, ]})

In []: left
Out[]: 
   key  lval
  foo     
  foo     

In []: right
Out[]: 
   key  rval
  foo     
  foo     

In []: pd.merge(left, right, on='key')
Out[]: 
   key  lval  rval
  foo          
  foo          
  foo          
  foo          
           

append方法

在DataFrame中增加一列:

In []: df = pd.DataFrame(np.random.randn(, ), columns=['A','B','C','D'])

In []: df
Out[]: 
          A         B         C         D
       -
 -      
 -   - -
     - -
 - -   -
 -      
        
 -     -

In []: s = df.iloc[]

In []: df.append(s, ignore_index=True)
Out[]: 
          A         B         C         D
       -
 -      
 -   - -
     - -
 - -   -
 -      
        
 -     -
           

分組(grouping)

在”group by”的時候涉及到以下幾步:

  • Spliting 按條件分割資料
  • Applying 在每組上應用函數
  • Combing 合并成一個資料集
In []: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(),
   ....:                    'D' : np.random.randn()})
   ....: 

In []: df
Out[]: 
     A      B         C         D
  foo    one - -
  bar    one -  
  foo    two    
  bar  three -  
  foo    two    
  bar    two - -
  foo    one   -
  foo  three   -
           

先分組然後應用sum函數:

In []: df.groupby('A').sum()
Out[]: 
            C        D
A                     
bar -  
foo   -
           

通過多列分組并生成層次索引,然後應用函數:

In []: df.groupby(['A','B']).sum()
Out[]: 
                  C         D
A   B                        
bar one   -  
    three -  
    two   - -
foo one   - -
    three   -
    two      
           

重塑(reshape)

stack方法

In []: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
   ....: 

In []: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In []: df = pd.DataFrame(np.random.randn(, ), index=index, columns=['A', 'B'])

In []: df2 = df[:]

In []: df2
Out[]: 
                     A         B
first second                    
bar   one      -
      two      -
baz   one    -  
      two       
           

stack方法用列标簽新增一層索引:

In []: stacked = df2.stack()

In []: stacked
Out[]: 
first  second   
bar    one     A    
               B   -
       two     A    
               B   -
baz    one     A   -
               B    
       two     A    
               B    
dtype: float64
           

stack方法的逆操作為unstack,預設解壓最後一層:

In []: stacked.unstack()
Out[]: 
                     A         B
first second                    
bar   one      -
      two      -
baz   one    -  
      two       

In []: stacked.unstack()
Out[]: 
second        one       two
first                      
bar   A    
      B - -
baz   A -  
      B    

In []: stacked.unstack()
Out[]: 
first          bar       baz
second                      
one    A   -
       B -  
two    A    
       B -  
           

透視表(Pivot table)

In []: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * ,
   .....:                    'B' : ['A', 'B', 'C'] * ,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * ,
   .....:                    'D' : np.random.randn(),
   .....:                    'E' : np.random.randn()})
   .....: 

In []: df
Out[]: 
        A  B    C         D         E
     one  A  foo   -
     one  B  foo -  
     two  C  foo   -
   three  A  bar    
     one  B  bar -  
     one  C  bar - -
     two  A  foo   -
   three  B  foo -  
     one  C  foo   -
     one  A  bar -  
    two  B  bar -  
  three  C  bar    
           

可以通過pivot_table方法很輕松的透視資料:

In []: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[]: 
C             bar       foo
A     B                    
one   A -  
      B - -
      C -  
three A         NaN
      B       NaN -
      C         NaN
two   A       NaN  
      B -       NaN
      C       NaN  
           

時間序列(Time Series)

pandas 擁有簡單,強大,高效的函數用來處理頻率轉換中的重采樣問題(例如将秒資料轉換為5分鐘資料)。

In []: rng = pd.date_range('1/1/2012', periods=, freq='S')

In []: ts = pd.Series(np.random.randint(, , len(rng)), index=rng)

In []: ts.resample('5Min').sum()
Out[]: 
--    
Freq: T, dtype: int64
           

時區表示:

In []: rng = pd.date_range('3/6/2012 00:00', periods=, freq='D')

In []: ts = pd.Series(np.random.randn(len(rng)), rng)

In []: ts
Out[]: 
--    
--    
--   -
--    
--   -
Freq: D, dtype: float64

In []: ts_utc = ts.tz_localize('UTC')

In []: ts_utc
Out[]: 
-- ::+:    
-- ::+:    
-- ::+:   -
-- ::+:    
-- ::+:   -
Freq: D, dtype: float64
           

轉換時區:

In []: ts_utc.tz_convert('US/Eastern')
Out[]: 
-- ::-:    
-- ::-:    
-- ::-:   -
-- ::-:    
-- ::-:   -
Freq: D, dtype: float64
           

時區跨度轉換:

In []: rng = pd.date_range('1/1/2012', periods=, freq='M')

In []: ts = pd.Series(np.random.randn(len(rng)), index=rng)

In []: ts
Out[]: 
--   -
--   -
--   -
--    
--    
Freq: M, dtype: float64

In []: ps = ts.to_period()

In []: ps
Out[]: 
-   -
-   -
-   -
-    
-    
Freq: M, dtype: float64

In []: ps.to_timestamp()
Out[]: 
--   -
--   -
--   -
--    
--    
Freq: MS, dtype: float64
           

period和timestamp之間的轉換讓某些算術函數應用起來非常友善。下面的例子将一個quarterly frequency with year ending in November 轉化成 9am of the end of the month following the quarter end:

In []: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

In []: ts = pd.Series(np.random.randn(len(prng)), prng)

In []: ts.index = (prng.asfreq('M', 'e') + ).asfreq('H', 's') + 

In []: ts.head()
Out[]: 
-- 09:   -.
-- 09:    .8159
-09- 09:   -.873
-- 09:   -.
-- 09:   -
Freq: H, dtype: float64
           

Categoricals

從0.15版開始,DateFrame已經包含了categorical類型

将原始資料轉換為categorical類型:

In []: df["grade"] = df["raw_grade"].astype("category")

In []: df["grade"]
Out[]: 
    a
    b
    b
    a
    a
    e
Name: grade, dtype: category
Categories (, object): [a, b, e]
           

重命名categorical類型:

In []: df["grade"].cat.categories = ["very good","good","very bad"]
           

重新排列并新增缺失資料:

In []: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

In []: df["grade"]
Out[]: 
    very good
         good
         good
    very good
    very good
     very bad
Name: grade, dtype: category
Categories (, object): [very bad, bad, medium, good, very good]
           

排序:

In []: df.sort_values(by="grade")
Out[]: 
   id raw_grade      grade
            e   very bad
            b       good
            b       good
            a  very good
            a  very good
            a  very good
           

分組:

In []: df.groupby("grade").size()
Out[]: 
grade
very bad     
bad          
medium       
good         
very good    
dtype: int64
           

畫圖

In []: ts = pd.Series(np.random.randn(), index=pd.date_range('1/1/2000', periods=))

In []: ts = ts.cumsum()

In []: ts.plot()
Out[]: <matplotlib.axes._subplots.AxesSubplot at >
           
【翻譯】Pandas 十分鐘入門建立對象可視化資料選擇資料缺失資料處理資料操作合并(merge)分組(grouping)重塑(reshape)透視表(Pivot table)時間序列(Time Series)Categoricals畫圖檔案輸入輸出擷取資料(Getting Data In/Out)HDF5Gotchas 什麼鬼?

在DataFrame中畫出所有列:

In []: df = pd.DataFrame(np.random.randn(, ), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 

In []: df = df.cumsum()

In []: plt.figure(); df.plot(); plt.legend(loc='best')
Out[]: <matplotlib.legend.Legend at >
           
【翻譯】Pandas 十分鐘入門建立對象可視化資料選擇資料缺失資料處理資料操作合并(merge)分組(grouping)重塑(reshape)透視表(Pivot table)時間序列(Time Series)Categoricals畫圖檔案輸入輸出擷取資料(Getting Data In/Out)HDF5Gotchas 什麼鬼?

檔案輸入輸出擷取資料(Getting Data In/Out)

csv

将資料寫入一個csv檔案:

In []: df.to_csv('foo.csv')
           

讀取csv資料檔案:

In []: pd.read_csv('foo.csv')
Out[]: 
     Unnamed:           A          B         C          D
    --     - -   
    --  -  -    -
    --  -       -
    --  -       -
    --          -
    --       -  -
    --     - -  -
..          ...        ...        ...       ...        ...
  -- -  - -  
  -- -  - -  
  --  -  - -  
  --  -  - -  
  --  -  - -  
  -- -  - -  
  -- - - -  

[ rows x  columns]
           

HDF5

寫入HDF5:

In []: df.to_hdf('foo.h5','df')
           

讀取HDF5檔案:

In []: pd.read_hdf('foo.h5','df')
Out[]: 
                    A          B         C          D
--     - -   
--  -  -    -
--  -       -
--  -       -
--          -
--       -  -
--     - -  -
...               ...        ...       ...        ...
-- -  - -  
-- -  - -  
--  -  - -  
--  -  - -  
--  -  - -  
-- -  - -  
-- - - -  

[ rows x  columns]
           

Excel

寫入excel:

In []: df.to_excel('foo.xlsx', sheet_name='Sheet1')
           

讀取Excel:

In []: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[]: 
                    A          B         C          D
--     - -   
--  -  -    -
--  -       -
--  -       -
--          -
--       -  -
--     - -  -
...               ...        ...       ...        ...
-- -  - -  
-- -  - -  
--  -  - -  
--  -  - -  
--  -  - -  
-- -  - -  
-- - - -  

[ rows x  columns]
           

Gotchas 什麼鬼?

If you are trying an operation and you see an exception like:

>>> if pd.Series([False, True, False]):
    print("I was true")
Traceback
    ...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().