天天看點

Pandas進階教程之:時間處理簡介時間分類TimestampPeriodDateOffset作為index時間序列的操作Resampling 重新取樣

簡介

時間應該是在資料進行中經常會用到的一種資料類型,除了Numpy中datetime64 和 timedelta64 這兩種資料類型之外,pandas 還整合了其他python庫比如

scikits.timeseries

中的功能。

時間分類

pandas中有四種時間類型:

  1. Date times : 日期和時間,可以帶時區。和标準庫中的

    datetime.datetime

    類似。
  2. Time deltas: 絕對持續時間,和 标準庫中的

    datetime.timedelta

  3. Time spans: 由時間點及其關聯的頻率定義的時間跨度。
  4. Date offsets:基于月曆計算的時間 和 dateutil.relativedelta.relativedelta 類似。

我們用一張表來表示:

類型 标量class 數組class pandas資料類型 主要建立方法
Date times

Timestamp

DatetimeIndex

datetime64[ns]

or

datetime64[ns, tz]

to_datetime

date_range

Time deltas

Timedelta

TimedeltaIndex

timedelta64[ns]

to_timedelta

timedelta_range

Time spans

Period

PeriodIndex

period[freq]

Period

period_range

Date offsets

DateOffset

None

None

DateOffset

看一個使用的例子:

In [19]: pd.Series(range(3), index=pd.date_range("2000", freq="D", periods=3))
Out[19]: 
2000-01-01    0
2000-01-02    1
2000-01-03    2
Freq: D, dtype: int64      

看一下上面資料類型的空值:

In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
# Equality acts as np.nan would
In [27]: pd.NaT == pd.NaT
Out[27]: False      

Timestamp

Timestamp 是最基礎的時間類型,我們可以這樣建立:

In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')
In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp('2012-05-01 00:00:00')
In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp('2012-05-01 00:00:00')      

DatetimeIndex

Timestamp 作為index會自動被轉換為DatetimeIndex:

In [33]: dates = [
   ....:     pd.Timestamp("2012-05-01"),
   ....:     pd.Timestamp("2012-05-02"),
   ....:     pd.Timestamp("2012-05-03"),
   ....: ]
   ....: 
In [34]: ts = pd.Series(np.random.randn(3), dates)
In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex
In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)
In [37]: ts
Out[37]: 
2012-05-01    0.469112
2012-05-02   -0.282863
2012-05-03   -1.509059
dtype: float64      

date_range 和 bdate_range

還可以使用 date_range 來建立DatetimeIndex:

In [74]: start = datetime.datetime(2011, 1, 1)
In [75]: end = datetime.datetime(2012, 1, 1)
In [76]: index = pd.date_range(start, end)
In [77]: index
Out[77]: 
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
               '2011-01-09', '2011-01-10',
               ...
               '2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
               '2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
               '2011-12-31', '2012-01-01'],
              dtype='datetime64[ns]', length=366, freq='D')      

date_range

是月曆範圍,

bdate_range

是工作日範圍:

In [78]: index = pd.bdate_range(start, end)
In [79]: index
Out[79]: 
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
               '2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
               '2011-01-13', '2011-01-14',
               ...
               '2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
               '2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
               '2011-12-29', '2011-12-30'],
              dtype='datetime64[ns]', length=260, freq='B')      

兩個方法都可以帶上

start

,

end

, 和

periods

參數。

In [84]: pd.bdate_range(end=end, periods=20)
In [83]: pd.date_range(start, end, freq="W")
In [86]: pd.date_range("2018-01-01", "2018-01-05", periods=5)      

origin

使用

origin

參數,可以修改

DatetimeIndex

的起點:

In [67]: pd.to_datetime([1, 2, 3], unit="D", origin=pd.Timestamp("1960-01-01"))
Out[67]: DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'], dtype='datetime64[ns]', freq=None)      

預設情況下

origin='unix'

, 也就是起點是

1970-01-01 00:00:00

.

In [68]: pd.to_datetime([1, 2, 3], unit="D")
Out[68]: DatetimeIndex(['1970-01-02', '1970-01-03', '1970-01-04'], dtype='datetime64[ns]', freq=None)      

格式化

使用format參數可以對時間進行格式化:

In [51]: pd.to_datetime("2010/11/12", format="%Y/%m/%d")
Out[51]: Timestamp('2010-11-12 00:00:00')
In [52]: pd.to_datetime("12-11-2010 00:00", format="%d-%m-%Y %H:%M")
Out[52]: Timestamp('2010-11-12 00:00:00')      

Period

Period 表示的是一個時間跨度,通常和freq一起使用:

In [31]: pd.Period("2011-01")
Out[31]: Period('2011-01', 'M')
In [32]: pd.Period("2012-05", freq="D")
Out[32]: Period('2012-05-01', 'D')      

Period可以直接進行運算:

In [345]: p = pd.Period("2012", freq="A-DEC")
In [346]: p + 1
Out[346]: Period('2013', 'A-DEC')
In [347]: p - 3
Out[347]: Period('2009', 'A-DEC')
In [348]: p = pd.Period("2012-01", freq="2M")
In [349]: p + 2
Out[349]: Period('2012-05', '2M')
In [350]: p - 1
Out[350]: Period('2011-11', '2M')      
注意,Period隻有具有相同的freq才能進行算數運算。包括 offsets 和 timedelta
In [352]: p = pd.Period("2014-07-01 09:00", freq="H")
In [353]: p + pd.offsets.Hour(2)
Out[353]: Period('2014-07-01 11:00', 'H')
In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period('2014-07-01 11:00', 'H')
In [355]: p + np.timedelta64(7200, "s")
Out[355]: Period('2014-07-01 11:00', 'H')      

Period作為index可以自動被轉換為PeriodIndex:

In [38]: periods = [pd.Period("2012-01"), pd.Period("2012-02"), pd.Period("2012-03")]
In [39]: ts = pd.Series(np.random.randn(3), periods)
In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex
In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')
In [42]: ts
Out[42]: 
2012-01   -1.135632
2012-02    1.212112
2012-03   -0.173215
Freq: M, dtype: float64      

可以通過 pd.period_range 方法來建立 PeriodIndex:

In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")
In [360]: prng
Out[360]: 
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
             '2011-07', '2011-08', '2011-09', '2011-10', '2011-11', '2011-12',
             '2012-01'],
            dtype='period[M]', freq='M')      

還可以通過PeriodIndex直接建立:

In [361]: pd.PeriodIndex(["2011-1", "2011-2", "2011-3"], freq="M")
Out[361]: PeriodIndex(['2011-01', '2011-02', '2011-03'], dtype='period[M]', freq='M')      

DateOffset

DateOffset表示的是頻率對象。它和Timedelta很類似,表示的是一個持續時間,但是有特殊的月曆規則。比如Timedelta一天肯定是24小時,而在 DateOffset中根據夏令時的不同,一天可能會有23,24或者25小時。

# This particular day contains a day light savings time transition
In [144]: ts = pd.Timestamp("2016-10-30 00:00:00", tz="Europe/Helsinki")
# Respects absolute time
In [145]: ts + pd.Timedelta(days=1)
Out[145]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')
# Respects calendar time
In [146]: ts + pd.DateOffset(days=1)
Out[146]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')
In [147]: friday = pd.Timestamp("2018-01-05")
In [148]: friday.day_name()
Out[148]: 'Friday'
# Add 2 business days (Friday --> Tuesday)
In [149]: two_business_days = 2 * pd.offsets.BDay()
In [150]: two_business_days.apply(friday)
Out[150]: Timestamp('2018-01-09 00:00:00')
In [151]: friday + two_business_days
Out[151]: Timestamp('2018-01-09 00:00:00')
In [152]: (friday + two_business_days).day_name()
Out[152]: 'Tuesday'      

DateOffsets 和Frequency 運算是先關的,看一下可用的Date Offset 和它相關聯的 Frequency:

Date Offset Frequency String 描述

DateOffset

None 通用的offset 類

BDay

BusinessDay

'B'

工作日

CDay

CustomBusinessDay

'C'

自定義的工作日

Week

'W'

一周

WeekOfMonth

'WOM'

每個月的第幾周的第幾天

LastWeekOfMonth

'LWOM'

每個月最後一周的第幾天

MonthEnd

'M'

月曆月末
MonthBegin

'MS'

月曆月初

BMonthEnd

BusinessMonthEnd

'BM'

營業月底

BMonthBegin

BusinessMonthBegin

'BMS'

營業月初

CBMonthEnd

CustomBusinessMonthEnd

'CBM'

自定義營業月底

CBMonthBegin

CustomBusinessMonthBegin

'CBMS'

自定義營業月初

SemiMonthEnd

'SM'

月曆月末的第15天

SemiMonthBegin

'SMS'

月曆月初的第15天

QuarterEnd

'Q'

月曆季末

QuarterBegin

'QS'

月曆季初

BQuarterEnd

'BQ

工作季末

BQuarterBegin

'BQS'

工作季初

FY5253Quarter

'REQ'

零售季( 52-53 week)

YearEnd

'A'

月曆年末

YearBegin

'AS'

'BYS'

月曆年初

BYearEnd

'BA'

營業年末

BYearBegin

'BAS'

營業年初

FY5253

'RE'

零售年 (aka 52-53 week)

Easter

複活節假期

BusinessHour

'BH'

business hour

CustomBusinessHour

'CBH'

custom business hour

Day

'D'

一天的絕對時間

Hour

'H'

一小時

Minute

'T'

'min'

一分鐘

Second

'S'

一秒鐘

Milli

'L'

'ms'

一微妙

Micro

'U'

'us'

一毫秒

Nano

'N'

一納秒

DateOffset還有兩個方法

rollforward()

rollback()

可以将時間進行移動:

In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")
In [154]: ts.day_name()
Out[154]: 'Saturday'
# BusinessHour's valid offset dates are Monday through Friday
In [155]: offset = pd.offsets.BusinessHour(start="09:00")
# Bring the date to the closest offset date (Monday)
In [156]: offset.rollforward(ts)
Out[156]: Timestamp('2018-01-08 09:00:00')
# Date is brought to the closest offset date first and then the hour is added
In [157]: ts + offset
Out[157]: Timestamp('2018-01-08 10:00:00')      

上面的操作會自動儲存小時,分鐘等資訊,如果想要設定為 00:00:00 , 可以調用normalize() 方法:

In [158]: ts = pd.Timestamp("2014-01-01 09:00")
In [159]: day = pd.offsets.Day()
In [160]: day.apply(ts)
Out[160]: Timestamp('2014-01-02 09:00:00')
In [161]: day.apply(ts).normalize()
Out[161]: Timestamp('2014-01-02 00:00:00')
In [162]: ts = pd.Timestamp("2014-01-01 22:00")
In [163]: hour = pd.offsets.Hour()
In [164]: hour.apply(ts)
Out[164]: Timestamp('2014-01-01 23:00:00')
In [165]: hour.apply(ts).normalize()
Out[165]: Timestamp('2014-01-01 00:00:00')
In [166]: hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
Out[166]: Timestamp('2014-01-02 00:00:00')      

作為index

時間可以作為index,并且作為index的時候會有一些很友善的特性。

可以直接使用時間來擷取相應的資料:

In [99]: ts["1/31/2011"]
Out[99]: 0.11920871129693428
In [100]: ts[datetime.datetime(2011, 12, 25):]
Out[100]: 
2011-12-30    0.56702
Freq: BM, dtype: float64
In [101]: ts["10/31/2011":"12/31/2011"]
Out[101]: 
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64      

擷取全年的資料:

In [102]: ts["2011"]
Out[102]: 
2011-01-31    0.119209
2011-02-28   -1.044236
2011-03-31   -0.861849
2011-04-29   -2.104569
2011-05-31   -0.494929
2011-06-30    1.071804
2011-07-29    0.721555
2011-08-31   -0.706771
2011-09-30   -1.039575
2011-10-31    0.271860
2011-11-30   -0.424972
2011-12-30    0.567020
Freq: BM, dtype: float64      

擷取某個月的資料:

In [103]: ts["2011-6"]
Out[103]: 
2011-06-30    1.071804
Freq: BM, dtype: float64      

DF可以接受時間作為loc的參數:

In [105]: dft
Out[105]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]
In [106]: dft.loc["2013"]
Out[106]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
[100000 rows x 1 columns]      

時間切片:

In [107]: dft["2013-1":"2013-2"]
Out[107]: 
                            A
2013-01-01 00:00:00  0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00  0.113648
2013-01-01 00:04:00 -1.478427
...                       ...
2013-02-28 23:55:00  0.850929
2013-02-28 23:56:00  0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
[84960 rows x 1 columns]      

切片和完全比對

考慮下面的一個精度為分的Series對象:

In [120]: series_minute = pd.Series(
   .....:     [1, 2, 3],
   .....:     pd.DatetimeIndex(
   .....:         ["2011-12-31 23:59:00", "2012-01-01 00:00:00", "2012-01-01 00:02:00"]
   .....:     ),
   .....: )
   .....: 
In [121]: series_minute.index.resolution
Out[121]: 'minute'      

時間精度小于分的話,傳回的是一個Series對象:

In [122]: series_minute["2011-12-31 23"]
Out[122]: 
2011-12-31 23:59:00    1
dtype: int64      

時間精度大于分的話,傳回的是一個常量:

In [123]: series_minute["2011-12-31 23:59"]
Out[123]: 1
In [124]: series_minute["2011-12-31 23:59:00"]
Out[124]: 1      

同樣的,如果精度為秒的話,小于秒會傳回一個對象,等于秒會傳回常量值。

時間序列的操作

Shifting

使用shift方法可以讓 time series 進行相應的移動:

In [275]: ts = pd.Series(range(len(rng)), index=rng)
In [276]: ts = ts[:5]
In [277]: ts.shift(1)
Out[277]: 
2012-01-01    NaN
2012-01-02    0.0
2012-01-03    1.0
Freq: D, dtype: float64      

通過指定 freq , 可以設定shift的方式:

In [278]: ts.shift(5, freq="D")
Out[278]: 
2012-01-06    0
2012-01-07    1
2012-01-08    2
Freq: D, dtype: int64
In [279]: ts.shift(5, freq=pd.offsets.BDay())
Out[279]: 
2012-01-06    0
2012-01-09    1
2012-01-10    2
dtype: int64
In [280]: ts.shift(5, freq="BM")
Out[280]: 
2012-05-31    0
2012-05-31    1
2012-05-31    2
dtype: int64      

頻率轉換

時間序列可以通過調用 asfreq 的方法轉換其頻率:

In [281]: dr = pd.date_range("1/1/2010", periods=3, freq=3 * pd.offsets.BDay())
In [282]: ts = pd.Series(np.random.randn(3), index=dr)
In [283]: ts
Out[283]: 
2010-01-01    1.494522
2010-01-06   -0.778425
2010-01-11   -0.253355
Freq: 3B, dtype: float64
In [284]: ts.asfreq(pd.offsets.BDay())
Out[284]: 
2010-01-01    1.494522
2010-01-04         NaN
2010-01-05         NaN
2010-01-06   -0.778425
2010-01-07         NaN
2010-01-08         NaN
2010-01-11   -0.253355
Freq: B, dtype: float64      

asfreq還可以指定修改頻率過後的填充方法:

In [285]: ts.asfreq(pd.offsets.BDay(), method="pad")
Out[285]: 
2010-01-01    1.494522
2010-01-04    1.494522
2010-01-05    1.494522
2010-01-06   -0.778425
2010-01-07   -0.778425
2010-01-08   -0.778425
2010-01-11   -0.253355
Freq: B, dtype: float64      

Resampling 重新取樣

給定的時間序列可以通過調用resample方法來重新取樣:

In [286]: rng = pd.date_range("1/1/2012", periods=100, freq="S")
In [287]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [288]: ts.resample("5Min").sum()
Out[288]: 
2012-01-01    25103
Freq: 5T, dtype: int64      

resample 可以接受各類統計方法,比如:

sum

mean

std

sem

max

min

median

first

last

ohlc

In [289]: ts.resample("5Min").mean()
Out[289]: 
2012-01-01    251.03
Freq: 5T, dtype: float64
In [290]: ts.resample("5Min").ohlc()
Out[290]: 
            open  high  low  close
2012-01-01   308   460    9    205
In [291]: ts.resample("5Min").max()
Out[291]: 
2012-01-01    460
Freq: 5T, dtype: int64      
本文已收錄于 http://www.flydean.com/15-python-pandas-time/

最通俗的解讀,最深刻的幹貨,最簡潔的教程,衆多你不知道的小技巧等你來發現!

歡迎關注我的公衆号:「程式那些事」,懂技術,更懂你!