pandas 的官方文档:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
1. 重新索引
作用:创建一个适应新索引的新对象,会根据新索引对原数据进行重排,如果是新引入的索引,则会引入缺失值(也可用 fill_value 指定填充值)。
reindex 的函数参数:
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying |
method | Interpolation (fill) method, see table for options. |
fill_value | Substitute value to use when introducing missing data by reindexing |
limit | When forward- or backfilling, maximum size gap to fill |
level | Match simple Index on level of MultiIndex, otherwise select subset of |
copy | Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data). |
In [49]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [50]: obj
Out[50]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
In [51]: obj.reindex(['a','b','c','d','e']) # obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[51]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN # e 0
dtype: float64
对于有序的索引序列,在重新索引时,我们可以用 method 选项进行前后填充值:
In [56]: obj1 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [57]: obj1.reindex(range(6),method='ffill')
Out[59]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
reindex 的(插值)method选项:
ffill or pad | Fill (or carry) values forward |
bfill or backfill | Fill (or carry) values backward |
对于 Dataframe 可以单独重新指定 index 和 columns,也可以同时指定,默认是重新索引行。
Dataframe 中的插值只能应用在行上(即轴0)。
n [64]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
...: columns=['Ohio', 'Texas', 'California'])
In [66]: frame.reindex(['a','b','c','d'])
In [67]: states = ['Texas', 'Utah', 'California']
In [68]: frame.reindex(columns=states)
In [71]: frame.reindex(index=['a','b','c','d'],columns=states)
In [82]: frame.reindex(index=['a','b','c','d'],method='ffill').reindex(columns=states)
# python数据分析书中这句代码是:
# frme.reindex(index=['a', 'b', 'c', 'd'], method='ffill',columns=states)
# 由于版本的原因执行这句代码可能会出错
python数据分析书上利用 ix 的标签索引功能,这个在未来可能会废弃掉:
In[87]:frame.ix[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:Passing list-likes to
.loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[87]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
In [88]: frame.loc[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future,
you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[88]:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
2. 删除指定轴上的项
drop 方法返回的是一个删除指定轴上的项后的新对象。
In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [97]: new_obj = obj.drop('c')
In [98]: obj.drop(['d', 'c'])
Out[98]:
a 0.0
b 1.0
e 4.0
dtype: float64
对于 Dataframe 可以删除任意轴上的索引值:
In [99]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
...: ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
...: ....: columns=['one', 'two', 'three', 'four'])
In [100]: data
Out[100]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [101]: data.drop(['Colorado', 'Ohio'])
Out[101]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [102]: data.drop('two', axis=1)
Out[102]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [103]: data.drop(['two', 'four'], axis=1)
Out[103]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
3.索引、选取、过滤
Series 的类似于numpy 数组的索引:
In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [103]: obj['b']
In [104]: obj[1]
In [105]: obj[2:4]
In [106]: obj[['b', 'a', 'd']]
In [107]: obj[[1, 3]]
In [108]: obj[obj < 2]
利用标签进行索引和赋值(其末端包含):
In [110]: obj['b':'c'] = 5
对于 Dataframe 进行索引就是选取一个或多个列:
In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [114]: data['two']
In [115]: data[['three', 'one']]
通过切片或布尔型数组选取行:
In [116]: data[:2]
In [117]: data[data['three'] > 5]
通过布尔型进行索引:
In [118]: data < 5
In [119]: data[data < 5] = 0
In [120]: data
用 ix 进行索引列和行(未来可能废除,改用其他方法,例:loc、iloc):
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
In [121]: data.ix['Colorado', ['two', 'three']]
In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
In [123]: data.ix[2]
In [124]: data.ix[:'Utah', 'two']
In [125]: data.ix[data.three > 5, :3]
Dataframe 的索引选项:
obj.ix[val] | Selects single row of subset of rows from the DataFrame. |
obj.ix[:, val] | Selects single column of subset of columns. |
obj.ix[val1, val2] | Select both rows and columns. |
reindex | Conform one or more axes to new indexes. |
xs | Select single row or column as a Series by label. |
icol, irow | Select single column or row, respectively, as a Series by integer location. |
get_value, set_value | Select single value by row and column label. |
4. 算术运算
对象相加时,结果索引是每个对象的索引的并集,对于不重叠的索引,其值会填充 NA(可以指定填充值):
In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [128]: s1 + s2
Out[128]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
In [129]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
...: index=['Ohio', 'Texas', 'Colorado'])
In [130]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [131]: df1 + df2
Out[131]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
In [132]: df1.add(df2,fill_value=0)
Out[132]:
b c d e
Colorado 6.0 7.0 8.0 NaN
Ohio 3.0 1.0 6.0 5.0
Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0
Method Description
add | Method for addition (+) |
sub | Method for subtraction (-) |
div | Method for division (/) |
mul | Method for multiplication (*) |
Dataframe 和 Series 之间的运算
这两者之间的运算涉及到了广播的知识,以后会有介绍广播相关的知识。一维二维的广播都比较容易理解。
In [153]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [154]: series = frame.iloc[0]
In [155]: frame - series # 嗯,大概就是这样,理解一下
Out[155]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果你希望匹配行且在列上广播,必须使用算数运算:
In [160]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [161]: series1 = frame['d']
In [162]: frame.sub(series1,axis=0)
Out[162]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
5. 函数应用于映射
numpy 的 ufuncs (元素级数组方法)也可用于操作pandas对象:
In [164]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [167]: frame
Out[167]:
b d e
Utah -0.637896 0.509292 -0.919939
Ohio -0.604495 0.298296 -0.377575
Texas -0.710751 -0.091902 0.607375
Oregon 0.576612 1.664728 0.264065
In [168]: np.abs(frame)
Out[168]:
b d e
Utah 0.637896 0.509292 0.919939
Ohio 0.604495 0.298296 0.377575
Texas 0.710751 0.091902 0.607375
Oregon 0.576612 1.664728 0.264065
也可用apply方法把函数应用到由各列或各行形成的一维数组上:
In [172]: f = lambda x: x.max() - x.min()
In [173]: frame.apply(f)
Out[173]:
b 1.287363
d 1.756631
e 1.527314
dtype: float64
也可返回多个值组成的Series:
In [176]: def f(x):
...: return pd.Series([x.min(), x.max()], index=['min', 'max'])
In [177]: frame.apply(f)
Out[177]:
b d e
min -0.710751 -0.091902 -0.919939
max 0.576612 1.664728 0.607375
元素级的python函数也是可用的,使用applymap方法:
In [179]: format = lambda x: '%.2f' % x
In [180]: frame.applymap(format)
Out[180]:
b d e
Utah -0.64 0.51 -0.92
Ohio -0.60 0.30 -0.38
Texas -0.71 -0.09 0.61
Oregon 0.58 1.66 0.26
6. 排序与排名
对行或列索引进行排序可以使用 sort_index 方法
对Series安值进行排序,可使用sort_values方法,若某个索引缺失值,则会被放到末尾
In [183]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [187]: obj.sort_index() # obj.sort_index(ascending=False) 降序
In [189]: obj.sort_values() # obj.sort_values(ascending=False) 降序
对于Dataframe 可以根据任意轴上的索引进行排序,默认是升序,也可降序排序:
In [196]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
...: columns=['d', 'a', 'b', 'c'])
In [197]: frame.sort_index()
Out[197]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [198]: frame.sort_index(axis=1)
Out[198]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [199]: frame.sort_index(axis=1, ascending=False)
Out[199]:
d c b a
three 0 3 2 1
one 4 7 6 5
在 Dataframe 上还可以使用 by 关键字,根据一或多列的值进行排序:
In [203]: frame.sort_values(by='b') # FutureWarning: by argument to sort_index
# is deprecated, please use .sort_values(by=...)
Out[203]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
In [204]: frame.sort_values(by=['a','b'])
Out[204]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
注意:对DataFrame的值进行排序的时候,我们必须要使用by指定某一行(列)或者某几行(列),
如果不使用by参数进行指定的时候,就会报TypeError: sort_values() missing 1 required positional argument: 'by'。
使用by参数进行某几列(行)排序的时候,以列表中的第一个为准,可能后面的不会生效,因为有的时候无法做到既对第一行(列)
进行升序排序又对第二行(列)进行排序。在指定行值进行排序的时候,必须设置axis=1,不然会报错,因为默认指定的是列索引,
找不到这个索引所以报错,axis=1的意思是指定行索引。
排名:
排名和排序有点类似,排名会有一个排名值(从1开始,一直到数组中有效数据的数量),它与numpy.argsort的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。
In [214]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) # 下标对应 (0, 1, 2, 3, 4, 5, 6)
In [215]: obj.rank() # 默认是根据值的大小进行平均排名
Out[215]:
0 6.5 # 7 最大 由于有两个 7 ,所以排名为 6,7 名,平均排名 6.5
1 1.0 # -5 最小 对应下标为 1 ,排在第一
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
根据值在数组中出现的顺序进行排名 :
In [216]: obj.rank(method='first') # 也可以按降序排名 obj.rank(ascending=False,method='max') 按照分组的最大排名排序
Out[216]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
也可以指定轴进行排名:
In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
...: 'c': [-2, 5, 8, -2.5]})
In [220]: frame
Out[220]:
a b c
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
In [221]: frame.rank(axis=1)
Out[221]:
a b c
0 2.0 3.0 1.0
1 1.0 3.0 2.0
2 2.0 1.0 3.0
3 2.0 3.0 1.0
排名时用于破坏平级关系的method选项:
average | Default: assign the average rank to each entry in the equal group |
min | Use the minimum rank for the whole group |
max | Use the maximum rank for the whole group |
first | Assign ranks in the order the values appear in the data |
7. 汇总和计算描述统计
可以指定对行或列进行统计。统计时默认会跳过 NA 值,也可以用 skipna 指定不跳过。
In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
...: [np.nan, np.nan], [0.75, -1.3]],
...: index=['a', 'b', 'c', 'd'],
...: columns=['one', 'two'])
In [225]: df
Out[225]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [227]: df.sum()
Out[227]:
one 9.25
two -5.80
dtype: float64
In [228]: df.sum(axis=1)
Out[228]:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
In [229]: df.sum(axis=1,skipna=False)
Out[229]:
a NaN
b 2.60
c NaN
d -0.55
dtype: float64
约简方法的选项
Method Description
axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns. |
skipna | Exclude missing values, True by default. |
level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex). |
描述和汇总统计
Method Description
count | Number of non-NA values |
describe | Compute set of summary statistics for Series or each DataFrame column |
min, max | Compute minimum and maximum values |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively |
quantile | Compute sample quantile ranging from 0 to 1 |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median (50% quantile) of values |
mad | Mean absolute deviation from mean value |
var | Sample variance of values |
std | Sample standard deviation of values |
skew | Sample skewness (3rd moment) of values |
kurt | Sample kurtosis (4th moment) of values |
cumsum | Cumulative sum of values |
cummin, cummax | Cumulative minimum or maximum of values, respectively |
cumprod | Cumulative product of values |
diff | Compute 1st arithmetic difference (useful for time series) |
pct_change | Compute percent changes |
8.唯一值,值计数,成员资格
In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
In [232]: uniques = obj.unique() # 唯一值
In [233]: uniques
Out[233]: array(['c', 'a', 'd', 'b'], dtype=object)
In [234]: obj.value_counts() # 计数值
Out[234]:
c 3
a 3
b 2
d 1
dtype: int64
In [235]: mask = obj.isin(['b', 'c']) # 成员关系
In [236]: mask
Out[236]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool
In [237]: obj[mask]
Out[237]:
0 c
5 b
6 b
7 c
8 c
dtype: object
Method Description
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values. |
unique | Compute array of unique values in a Series, returned in the order observed. |
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order. |
将pandas的 pandas.value_counts 传给 Dataframe 的 apply 函数:
In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
...: 'Qu2': [2, 3, 1, 2, 3],
...: 'Qu3': [1, 5, 2, 4, 4]})
In [240]: data
Out[240]:
Qu1 Qu2 Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [241]: result = data.apply(pd.value_counts).fillna(0)
In [242]: result
Out[242]:
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
9.缺失值的处理
pandas 中使用浮点值 NAN (Not a Number) 表示浮点和非浮点数组中的缺失数据,只是一种便于表示的标记。
NA处理的方法:
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. |
isnull | Return like-type object containing boolean values indicating which values are missing / NA. |
notnull | Negation of isnull. |
过滤缺失数据:dropna
对于 Series ,dropna()仅仅返回一个非空数据和索引值的 Series:
In [6]: from numpy import nan as NA
In [7]: data = pd.Series([1,NA,4,NA,5])
In [8]: data.dropna() # 也可以通过bool索引达到此目的:data[data.notnull()]
Out[8]:
0 1.0
2 4.0
4 5.0
dtype: float64
对于 Dataframe ,dropna 默认丢弃任何含有缺失值的行,传入参数 how='all' ,只丢弃全为NA的行。要丢弃为NA的行,传入参数 axis=1,即可。参数 thresh 可以保留部分数据。
填充缺失数据:fillna
In [9]: df = pd.DataFrame(np.random.randn(7,3))
In [10]: df
Out[10]:
0 1 2
0 -1.405991 -1.032070 -0.421601
1 0.878711 -0.786235 1.483527
2 -0.082090 -0.163028 -0.718293
3 -0.576532 0.229013 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
In [12]: df.iloc[:4,1] = NA
In [13]: df.iloc[:2,2] = NA
In [14]: df
Out[14]:
0 1 2
0 -1.405991 NaN NaN
1 0.878711 NaN NaN
2 -0.082090 NaN -0.718293
3 -0.576532 NaN 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
In [15]: df.fillna(0) # 用任意数去填充所有的缺失值
Out[15]:
0 1 2
0 -1.405991 0.000000 0.000000
1 0.878711 0.000000 0.000000
2 -0.082090 0.000000 -0.718293
3 -0.576532 0.000000 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
In [20]: df.fillna({1:0.5,2:1.}) # 传入一个字典,可以实现对列中缺失值的填充
Out[20]:
0 1 2
0 -1.405991 0.500000 1.000000
1 0.878711 0.500000 1.000000
2 -0.082090 0.500000 -0.718293
3 -0.576532 0.500000 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
fillna 默认总会产生一个新的对象,用inplace 参数实现就地修改。
用 method 参数填充数据:
In [27]: df.fillna(method='bfill')
Out[27]:
0 1 2
0 -1.405991 0.547743 -0.718293
1 0.878711 0.547743 -0.718293
2 -0.082090 0.547743 -0.718293
3 -0.576532 0.547743 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
In [28]: df.fillna(method='bfill',limit=2)
Out[28]:
0 1 2
0 -1.405991 NaN -0.718293
1 0.878711 NaN -0.718293
2 -0.082090 0.547743 -0.718293
3 -0.576532 0.547743 0.387237
4 -0.682892 0.547743 0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911 0.449039
fillna 参数:
value | Scalar value or dict-like object to use to fill missing values |
method | Interpolation, by default 'ffill' if function called with no other arguments |
axis | Axis to fill on, default axis=0 |
inplace | Modify the calling object without producing a copy |
limit | For forward and backward filling, maximum number of consecutive periods to fill |
10. 层次化索引
能在一个轴上拥有多个索引级别,能以低维度形式处理高纬度数据。
创建一个层次化索引的 Series:
In [31]: data = pd.Series(np.random.randn(10),
...: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
...: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
In [32]: data
Out[32]:
a 1 -0.413827
2 0.660228
3 0.209686
b 1 -0.361603
2 -0.982985
3 -0.267620
c 1 -1.130506
2 -2.023760
d 2 0.989250
3 1.074886
dtype: float64
In [33]: data.index
Out[33]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
索引:
In [34]: data['a']
Out[34]:
1 -0.413827
2 0.660228
3 0.209686
dtype: float64
In [35]: data[:,2] # 实现内层索引
Out[35]:
a 0.660228
b -0.982985
c -2.023760
d 0.989250dtype: float64
In [36]: data['b':'d'] # 实现切片索引
Out[36]:
b 1 -0.361603
2 -0.982985
3 -0.267620
c
1 -1.130506
2 -2.023760
d
2 0.989250
3 1.074886
dtype: float64
In [37]: data.loc[['b','c']]
Out[37]:
b
1 -0.361603
2 -0.982985
3 -0.267620
c
1 -1.130506
2 -2.023760
dtype: float64
层次化索引可以通过 unstack 方法生成 Dataframe 数据:
In [38]: data.unstack()
Out[38]:
1 2 3
a -0.413827 0.660228 0.209686
b -0.361603 -0.982985 -0.267620
c -1.130506 -2.023760 NaN
d NaN 0.989250 1.074886
In [39]: data.unstack().unstack()
Out[39]:
1 a -0.413827
b -0.361603
c -1.130506
d NaN
2 a 0.660228
b -0.982985
c -2.023760
d 0.989250
3 a 0.209686
b -0.267620
c NaN
d 1.074886
dtype: float64
In [42]: data.unstack().stack() # stack 是unstack的逆运算
Out[42]:
a 1 -0.413827
2 0.660228
3 0.209686
b 1 -0.361603
2 -0.982985
3 -0.267620
c 1 -1.130506
2 -2.023760
d 2 0.989250
3 1.074886
dtype: float64
对于 Dataframe 每条轴都可以有层次化索引,每个索引还都可以有名字:
In [44]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
...: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
...: columns=[['Ohio', 'Ohio', 'Colorado'],
...: ['Green', 'Red', 'Green']]
...: )
In [45]: frame
Out[45]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [46]: frame.index.names = ['key1', 'key2']
In [47]: frame.columns.names = ['state', 'color']
In [48]: frame
Out[48]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
swaplevel() : 调整某条轴上各级别的顺序;sort_index(): 对各级别上的数据值进行排序
In [49]: frame.swaplevel('key1','key2')
Out[49]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [51]: frame.swaplevel(0,1)
Out[51]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [53]: frame.swaplevel(0,1,axis=1)
Out[53]:
color Green Red Green
state Ohio Ohio Colorado
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [54]: frame.sortlevel(1)
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning: sortlevel is
deprecated, use sort_index(level= ...)
Out[54]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [55]: frame.sort_index(level=1)
Out[55]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [56]: frame.sort_index(level=1,axis=1)
Out[56]:
state Colorado Ohio
color Green Green Red
key1 key2
a 1 2 0 1
2 5 3 4
b 1 8 6 7
2 11 9 10
有的时候我们想要把 Dataframe 的某列或某几列当做 Dataframe 的索引:
In [59]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
...: 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
...: 'd': [0, 1, 2, 0, 1, 2, 3]})
In [60]: frame
Out[60]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
In [61]: frame1 = frame.set_index(['c','d'])
In [62]: frame1
Out[62]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
默认情况下,被当做索引的列会被移除,也可通过drop=False保存下来:
In [63]: frame.set_index(['c','d'],drop=False)
Out[63]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
reset_index的作用跟 set_index 正好相反:
In [64]: frame1.reset_index()
Out[64]:
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1