pandas處理時序資料

快速浏覽

- 時序的建立
- - 四類時間變量
  - Date times（時間點/時刻）
  - Date offsets（相對時間差）
- 時序的索引及屬性
- 重采樣
- 視窗函數rolling/expanding
- 練習
- Reference

時序的建立

四類時間變量

pandas處理時序資料

Date times（時間點/時刻）

Pandas在時間點建立的輸入格式規定上給了很大的自由度，下面的語句都能正确建立同一時間點:

pd.to_datetime('2020.1.1')
pd.to_datetime('2020 1.1')
pd.to_datetime('2020 1 1')
pd.to_datetime('2020 1-1')
pd.to_datetime('2020-1 1')
pd.to_datetime('2020-1-1')
pd.to_datetime('2020/1/1')
pd.to_datetime('1.1.2020')
pd.to_datetime('1.1 2020')
pd.to_datetime('1 1 2020')
pd.to_datetime('1 1-2020')
pd.to_datetime('1-1 2020')
pd.to_datetime('1-1-2020')
pd.to_datetime('1/1/2020')
pd.to_datetime('20200101')
pd.to_datetime('2020.0101')
#下面的語句都會報錯
#pd.to_datetime('2020\\1\\1')
#pd.to_datetime('2020`1`1')
#pd.to_datetime('2020.1 1')
#pd.to_datetime('1 1.2020')

語句會報錯時可利用format參數強制比對

pd.to_datetime('2020\\1\\1',format='%Y\\%m\\%d')
pd.to_datetime('2020`1`1',format='%Y`%m`%d')
pd.to_datetime('2020.1 1',format='%Y.%m %d')
pd.to_datetime('1 1.2020',format='%d %m.%Y')

使用清單可以将其轉為時間點索引

print(pd.Series(range(2),index=pd.to_datetime(['2020/1/1','2020/1/2'])))
print(type(pd.to_datetime(['2020/1/1','2020/1/2'])))

2020-01-01    0
2020-01-02    1
dtype: int64
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>

對于DataFrame而言，如果列已經按照時間順序排好，則利用to_datetime可自動轉換

df = pd.DataFrame({'year': [2020, 2020],'month': [1, 1], 'day': [1, 2]})
pd.to_datetime(df)

0   2020-01-01
1   2020-01-02
dtype: datetime64[ns]

Date times（時間點/時刻）Timestamp的精度遠遠不止day，可以最小到納秒ns;同時，它帶來範圍的代價就是隻有大約584年的時間點是可用.

print(pd.to_datetime('2020/1/1 00:00:00.123456789'))
print(pd.Timestamp.min)
print(pd.Timestamp.max)

2020-01-01 00:00:00.123456789
1677-09-21 00:12:43.145225
2262-04-11 23:47:16.854775807

date_range方法中start/end/periods（時間點個數）/freq（間隔方法）是該方法最重要的參數，給定了其中的3個，剩下的一個就會被确定。其中freq參數有許多選項（符号 D/B日/工作日 W周 M/Q/Y月/季/年末日 BM/BQ/BY月/季/年末工作日 MS/QS/YS月/季/年初日 BMS/BQS/BYS月/季/年初工作日 H小時 T分鐘 S秒），更多選項可看此處

print(pd.date_range(start='2020/1/1',end='2020/1/10',periods=3))
print(pd.date_range(start='2020/1/1',end='2020/1/10',freq='D'))
print(pd.date_range(start='2020/1/1',periods=3,freq='D'))
print(pd.date_range(end='2020/1/3',periods=3,freq='D'))
print(pd.date_range(start='2020/1/1',periods=3,freq='T'))
print(pd.date_range(start='2020/1/1',periods=3,freq='M'))
print(pd.date_range(start='2020/1/1',periods=3,freq='BYS'))

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-05 12:00:00',
               '2020-01-10 00:00:00'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10'],
              dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')
DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31'], dtype='datetime64[ns]', freq='M')
DatetimeIndex(['2020-01-01', '2021-01-01', '2022-01-03'], dtype='datetime64[ns]', freq='BAS-JAN')

bdate_range是一個類似與date_range的方法，特點在于可以在自帶的工作日間隔設定上，再選擇weekmask參數和holidays參數。它的freq中有一個特殊的’C’/‘CBM’/'CBMS’選項，表示定制，需要聯合weekmask參數和holidays參數使用。例如現在需要将工作日中的周一、周二、周五3天保留，并将部分holidays剔除

weekmask = 'Mon Tue Fri'
holidays = [pd.Timestamp('2020/1/%s'%i) for i in range(7,13)]
#注意holidays
pd.bdate_range(start='2020-1-1',end='2020-1-15',freq='C',weekmask=weekmask,holidays=holidays)

DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-13', '2020-01-14'], dtype='datetime64[ns]', freq='C')

Date offsets（相對時間差）

DataOffset與Timedelta的差別在于Timedelta絕對時間差的特點指無論是冬令時還是夏令時，增減1day都隻計算24小時。而DataOffset相對時間差指，無論一天是23\24\25小時，增減1day都與當天相同的時間保持一緻。

例如，英國當地時間 2020年03月29日，01:00:00 時鐘向前調整 1 小時變為 2020年03月29日，02:00:00，開始夏令時

pandas處理時序資料

DateOffset的可選參數包括years/months/weeks/days/hours/minutes/seconds

2019-12-18 00:20:00

pandas處理時序資料

序列的offset操作

print(pd.Series(pd.offsets.BYearBegin(3).apply(i) for i in pd.date_range('20200101',periods=3,freq='Y')))
print(pd.date_range('20200101',periods=3,freq='Y') + pd.offsets.BYearBegin(3))
print(pd.Series(pd.offsets.CDay(3,weekmask='Wed Fri',holidays='2020010').apply(i)
                                  for i in pd.date_range('20200105',periods=3,freq='D')))
#pd.date_range('20200105',periods=3,freq='D')
#DatetimeIndex(['2020-01-05', '2020-01-06', '2020-01-07'], dtype='datetime64[ns]', freq='D')

0   2023-01-02
1   2024-01-01
2   2025-01-01
dtype: datetime64[ns]
DatetimeIndex(['2023-01-02', '2024-01-01', '2025-01-01'], dtype='datetime64[ns]', freq='A-DEC')
0   2020-01-15
1   2020-01-15
2   2020-01-15
dtype: datetime64[ns]

時序的索引及屬性

索引切片幾乎與pandas索引的規則完全一緻。而且合法字元自動轉換為時間點，也支援混合形态索引。

rng = pd.date_range('2020','2021', freq='W')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts.head())
print(ts['2020-01-26'])
print(ts['2020-01-26':'20200306'])
print(ts['2020-7'])
print(ts['2011-1':'20200726'].head())

2020-01-05    1.101587
2020-01-12    0.344175
2020-01-19    0.521394
2020-01-26    0.535159
2020-02-02   -0.536123
Freq: W-SUN, dtype: float64
0.5351588314930403
2020-01-26    0.535159
2020-02-02   -0.536123
2020-02-09    0.109903
2020-02-16   -0.102390
2020-02-23   -0.524725
2020-03-01   -0.756281
Freq: W-SUN, dtype: float64

采用dt對象可以輕松獲得關于時間的資訊，對于datetime對象可以直接通過屬性擷取資訊，利用strftime可重新修改時間格式。

#print(pd.Series(ts.index).dt.week)
#print(pd.Series(ts.index).dt.day)
print(pd.Series(ts.index).dt.strftime('%Y-間隔1-%m-間隔2-%d').head())
print(pd.Series(ts.index).dt.strftime('%Y年%m月%d日').head())
print(pd.date_range('2020','2021', freq='W').month)

0    2020-間隔1-01-間隔2-05
1    2020-間隔1-01-間隔2-12
2    2020-間隔1-01-間隔2-19
3    2020-間隔1-01-間隔2-26
4    2020-間隔1-02-間隔2-02
dtype: object
0    2020年01月05日
1    2020年01月12日
2    2020年01月19日
3    2020年01月26日
4    2020年02月02日
dtype: object
Int64Index([ 1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,  3,  4,  4,  4,  4,
             5,  5,  5,  5,  5,  6,  6,  6,  6,  7,  7,  7,  7,  8,  8,  8,  8,
             8,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12,
            12],
           dtype='int64')

重采樣

所謂重采樣，就是指resample函數，它可以看做時序版本的groupby函數。采樣頻率一般設定為上面提到的offset字元，

print(pd.date_range('1/1/2020', freq='S', periods=1000))
df_r = pd.DataFrame(np.random.randn(1000, 3),index=pd.date_range('1/1/2020', freq='S', periods=1000),
                  columns=['A', 'B', 'C'])
r = df_r.resample('3min')
print(r.sum())

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:00:01',
               '2020-01-01 00:00:02', '2020-01-01 00:00:03',
               '2020-01-01 00:00:04', '2020-01-01 00:00:05',
               '2020-01-01 00:00:06', '2020-01-01 00:00:07',
               '2020-01-01 00:00:08', '2020-01-01 00:00:09',
               ...
               '2020-01-01 00:16:30', '2020-01-01 00:16:31',
               '2020-01-01 00:16:32', '2020-01-01 00:16:33',
               '2020-01-01 00:16:34', '2020-01-01 00:16:35',
               '2020-01-01 00:16:36', '2020-01-01 00:16:37',
               '2020-01-01 00:16:38', '2020-01-01 00:16:39'],
              dtype='datetime64[ns]', length=1000, freq='S')
                            A          B          C
2020-01-01 00:00:00 -6.214172  15.056536  -2.040001
2020-01-01 00:03:00 -0.974375  -5.857030 -10.369295
2020-01-01 00:06:00  1.836822  17.165221   9.111447
2020-01-01 00:09:00  2.030140   4.314473  14.528695
2020-01-01 00:12:00  7.339233   5.753052 -24.641334
2020-01-01 00:15:00 -8.736690  -0.122362  -2.023157

df_r2 = pd.DataFrame(np.random.randn(200, 3),index=pd.date_range('1/1/2020', freq='D', periods=200),
                  columns=['A', 'B', 'C'])
r = df_r2.resample('CBMS')
print(r.sum())

A          B         C
2020-01-01  1.518244  -0.743317 -3.515077
2020-02-03  1.378320   4.415827 -1.629024
2020-03-02 -0.705835  10.281621 -5.257010
2020-04-01  1.783766  -3.383655  2.103400
2020-05-01  4.551639   0.141568  5.081334
2020-06-01  2.434142  -1.549992 -0.175485
2020-07-01  0.569179  -2.901138 -4.751556

采樣聚合

r = df_r.resample('3T')
print(r['A'].mean())
print(r['A'].agg([np.sum, np.mean, np.std]))
#類似地，可以使用函數/lambda表達式
print(r.agg({'A': np.sum,'B': lambda x: max(x)-min(x)}))

2020-01-01 00:00:00   -0.034523
2020-01-01 00:03:00   -0.005413
2020-01-01 00:06:00    0.010205
2020-01-01 00:09:00    0.011279
2020-01-01 00:12:00    0.040774
2020-01-01 00:15:00   -0.087367
Freq: 3T, Name: A, dtype: float64
                          sum      mean       std
2020-01-01 00:00:00 -6.214172 -0.034523  1.083538
2020-01-01 00:03:00 -0.974375 -0.005413  0.994005
2020-01-01 00:06:00  1.836822  0.010205  0.970560
2020-01-01 00:09:00  2.030140  0.011279  1.017799
2020-01-01 00:12:00  7.339233  0.040774  1.068230
2020-01-01 00:15:00 -8.736690 -0.087367  0.969861
                            A         B
2020-01-01 00:00:00 -6.214172  5.676805
2020-01-01 00:03:00 -0.974375  5.332746
2020-01-01 00:06:00  1.836822  5.207914
2020-01-01 00:09:00  2.030140  5.258446
2020-01-01 00:12:00  7.339233  5.680593
2020-01-01 00:15:00 -8.736690  5.490354

采樣組的疊代和groupby疊代完全類似，對于每一個組都可以分别做相應操作

small = pd.Series(range(6),index=pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 00:30:00'
                                                 , '2020-01-01 00:31:00','2020-01-01 01:00:00'
                                                 ,'2020-01-01 03:00:00','2020-01-01 03:05:00']))
resampled = small.resample('H')
for name, group in resampled:
    print("Group: ", name)
    print("-" * 27)
    print(group, end="\n\n")

Group:  2020-01-01 00:00:00
---------------------------
2020-01-01 00:00:00    0
2020-01-01 00:30:00    1
2020-01-01 00:31:00    2
dtype: int64

Group:  2020-01-01 01:00:00
---------------------------
2020-01-01 01:00:00    3
dtype: int64

Group:  2020-01-01 02:00:00
---------------------------
Series([], dtype: int64)

Group:  2020-01-01 03:00:00
---------------------------
2020-01-01 03:00:00    4
2020-01-01 03:05:00    5
dtype: int64

視窗函數rolling/expanding

s = pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2020', periods=1000))
print(s)

2020-01-01    0.404380
2020-01-02   -0.211402
2020-01-03   -1.398175
2020-01-04    1.018577
2020-01-05    0.894150
                ...   
2022-09-22    0.132534
2022-09-23    0.606834
2022-09-24   -0.598215
2022-09-25   -0.127116
2022-09-26   -1.714029
Freq: D, Length: 1000, dtype: float64

rolling方法，就是規定一個視窗（min_periods參數是指需要的非缺失資料點數量閥值），它和groupby對象一樣，本身不會進行操作，需要配合聚合函數才能計算結果。count/sum/mean/median/min/max/std/var/skew/kurt/quantile/cov/corr都是常用的聚合函數。使用apply聚合時，隻需記住傳入的是window大小的Series，輸出的必須是标量即可。

基于時間的rolling可選closed=‘right’（預設）‘left’‘both’'neither’參數，決定端點的包含情況。

print(s.rolling(window=50))
print(s.rolling(window=50).mean())
print(s.rolling(window=50,min_periods=3).mean().head())
print(s.rolling(window=50,min_periods=3).apply(lambda x:x.std()/x.mean()).head())#計算變異系數
print(s.rolling('15D').mean().head())
print(s.rolling('15D', closed='right').sum().head())

Rolling [window=50,center=False,axis=0]
2020-01-01         NaN
2020-01-02         NaN
2020-01-03         NaN
2020-01-04         NaN
2020-01-05         NaN
                ...   
2022-09-22   -0.059734
2022-09-23   -0.059340
2022-09-24   -0.086238
2022-09-25   -0.062391
2022-09-26   -0.068321
Freq: D, Length: 1000, dtype: float64
2020-01-01         NaN
2020-01-02         NaN
2020-01-03   -0.401732
2020-01-04   -0.046655
2020-01-05    0.141506
Freq: D, dtype: float64
2020-01-01          NaN
2020-01-02          NaN
2020-01-03    -2.280690
2020-01-04   -22.108891
2020-01-05     6.977926
Freq: D, dtype: float64
2020-01-01    0.404380
2020-01-02    0.096489
2020-01-03   -0.401732
2020-01-04   -0.046655
2020-01-05    0.141506
Freq: D, dtype: float64
2020-01-01    0.404380
2020-01-02    0.192979
2020-01-03   -1.205196
2020-01-04   -0.186619
2020-01-05    0.707531
Freq: D, dtype: float64

普通的expanding函數等價與rolling(window=len(s),min_periods=1)，是對序列的累計計算。apply方法也是同樣可用的，cumsum/cumprod/cummax/cummin都是特殊expanding累計計算方法。

print(s.rolling(window=len(s),min_periods=1).sum().head())
print(s.expanding().sum().head())
print(s.expanding().apply(lambda x:sum(x)).head())
print(s.cumsum().head())

2020-01-01    0.404380
2020-01-02    0.192979
2020-01-03   -1.205196
2020-01-04   -0.186619
2020-01-05    0.707531
Freq: D, dtype: float64
2020-01-01    0.404380
2020-01-02    0.192979
2020-01-03   -1.205196
2020-01-04   -0.186619
2020-01-05    0.707531
Freq: D, dtype: float64
2020-01-01    0.404380
2020-01-02    0.192979
2020-01-03   -1.205196
2020-01-04   -0.186619
2020-01-05    0.707531
Freq: D, dtype: float64
2020-01-01    0.404380
2020-01-02    0.192979
2020-01-03   -1.205196
2020-01-04   -0.186619
2020-01-05    0.707531
Freq: D, dtype: float64

shift/diff/pct_change都是涉及到了元素關系

①shift是指序列索引不變，但值向後移動

②diff是指前後元素的差，period參數表示間隔，預設為1，并且可以為負

③pct_change是值前後元素的變化百分比，period參數與diff類似

pandas處理時序資料

練習

【練習一】現有一份關于某超市牛奶銷售額的時間序列資料

time_series_one.csv

，請完成下列問題：¶

（a）銷售額出現最大值的是星期幾？（提示：利用dayofweek函數）

df = pd.read_csv('data/time_series_one.csv', parse_dates=['日期'])
df['日期'].dt.dayofweek[df['銷售額'].idxmax()]

（b）計算除去春節、國慶、五一節假日的月度銷售總額

holiday = pd.date_range(start='20170501', end='20170503').append(
          pd.date_range(start='20171001', end='20171007')).append(
          pd.date_range(start='20180215', end='20180221')).append(
          pd.date_range(start='20180501', end='20180503')).append(
          pd.date_range(start='20181001', end='20181007')).append(
          pd.date_range(start='20190204', end='20190224')).append(
          pd.date_range(start='20190501', end='20190503')).append(
          pd.date_range(start='20191001', end='20191007'))
result = df[~df['日期'].isin(holiday)].set_index('日期').resample('MS').sum()
result

（c）按季度計算周末（周六和周日）的銷量總額

result = df[df['日期'].dt.dayofweek.isin([5,6])].set_index('日期').resample('QS').sum()
result

（d）從最後一天開始算起，跳過周六和周一，以5天為一個時間機關向前計算銷售總和

df_temp = df[~df['日期'].dt.dayofweek.isin([5,6])].set_index('日期').iloc[::-1]
L_temp,date_temp = [],[0]*df_temp.shape[0]
for i in range(df_temp.shape[0]//5):
    L_temp.extend([i]*5)
L_temp.extend([df_temp.shape[0]//5]*(df_temp.shape[0]-df_temp.shape[0]//5*5))
date_temp = pd.Series([i%5==0 for i in range(df_temp.shape[0])])
df_temp['num'] = L_temp
result = pd.DataFrame({'5天總額':df_temp.groupby('num')['銷售額'].sum().values},
                       index=df_temp.reset_index()[date_temp]['日期']).iloc[::-1]
result

（e）假設現在發現資料有誤，所有同一周裡的周一與周五的銷售額記錄颠倒了，請計算2018年中每月第一個周一的銷售額（如果該周沒有周一或周五的記錄就保持不動）

from datetime import datetime 
df_temp = df.copy()
df_fri = df.shift(4)[df.shift(4)['日期'].dt.dayofweek==1]['銷售額']
df_mon = df.shift(-4)[df.shift(-4)['日期'].dt.dayofweek==5]['銷售額']
df_temp.loc[df_fri.index,['銷售額']] = df_fri
df_temp.loc[df_mon.index,['銷售額']] = df_mon
df_temp.loc[df_temp[df_temp['日期'].dt.year==2018]['日期'][
        df_temp[df_temp['日期'].dt.year==2018]['日期'].apply(
        lambda x:True if datetime.strptime(str(x).split()[0],'%Y-%m-%d').weekday() == 0 
        and 1 <= datetime.strptime(str(x).split()[0],'%Y-%m-%d').day <= 7 else False)].index,:]

【練習二】繼續使用上一題的資料，請完成下列問題：

（a）以50天為視窗計算滑窗均值和滑窗最大值（min_periods設為1）

df = pd.read_csv('data/time_series_one.csv',index_col='日期',parse_dates=['日期'])
df['銷售額'].rolling(window=50,min_periods=1).mean().head()

（b）現在有如下規則：若當天銷售額超過向前5天的均值，則記為1，否則記為0，請給出2018年相應的計算結果

def f(x):
    if len(x) == 6:
        return 1 if x[-1]>np.mean(x[:-1]) else 0
    else:
        return 0
result_b = df.loc[pd.date_range(start='20171227',end='20181231'),:].rolling(
                                                    window=6,min_periods=1).agg(f)[5:].head()
result_b.head()

def f(x):
    if len(x) == 8:
        return 1 if x[-1]>np.mean(x[:-1][pd.Series([
            False if i in [5,6] else True for i in x[:-1].index.dayofweek],index=x[:-1].index)]) else 0
    else:
        return 0
result_c = df.loc[pd.date_range(start='20171225',end='20181231'),:].rolling(
                                    window=8,min_periods=1).agg(f)[7:].head()
result_c.head()

比較巧合，與(b)的結果一樣

Reference

pandas官網
Joyful-Pandas

pandas處理時序資料

快速浏覽

時序的建立

四類時間變量

Date times（時間點/時刻）

Date offsets（相對時間差）

時序的索引及屬性

重采樣

視窗函數rolling/expanding

練習

Reference

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入