import pandas as pd
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'],
'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
df
Age | Country | Income | |
---|---|---|---|
5000 | China | 10000 | |
1 | 4321 | China | 10000 |
2 | 1234 | India | 5000 |
3 | 4010 | India | 5002 |
4 | 250 | America | 40000 |
5 | 250 | Japan | 50000 |
6 | 4500 | China | 8000 |
7 | 4321 | India | 5000 |
主要就是了解下面這句話!!!後面的是一些細節!!
一個基礎的文法!!!
a=df.groupby('Country', as_index = False)['Income'].agg({ 'Income_sum':'sum'})
a
Country | Income_sum | |
---|---|---|
America | 40000 | |
1 | China | 28000 |
2 | India | 15002 |
3 | Japan | 50000 |
上面代碼的意思:統計每一種Conutry下Income的總和!!!
具體起到的作用:
①groupby(‘Country’, as_index = False) :按每一種Country進行統計!!!!!
②[‘Income’]:統計的是每一種Country下Income的相關特征!!!
③agg函數中 ‘Income_sum’:‘sum’,第一個字段’Income_sum’,其實你也可以寫’sum’,但是如果寫’Income_sum’,pandas會将
“每一個country下的Income總和”這個特征的特證名添為’Income_sum’
可以看到的是,這樣造出來的 Income_sum特征,需要拼接回原來的df上去
也就是
data_merged = pd.merge(df, a, on=['Country'], how='left')
data_merged
Age | Country | Income | Income_sum | |
---|---|---|---|---|
5000 | China | 10000 | 28000 | |
1 | 4321 | China | 10000 | 28000 |
2 | 1234 | India | 5000 | 15002 |
3 | 4010 | India | 5002 | 15002 |
4 | 250 | America | 40000 | 40000 |
5 | 250 | Japan | 50000 | 50000 |
6 | 4500 | China | 8000 | 28000 |
7 | 4321 | India | 5000 | 15002 |
比賽中經常用到的寫法
0 28000
1 28000
2 15002
3 15002
4 40000
5 50000
6 28000
7 15002
Name: Income, dtype: int64
可以看到,上面的資料,如第0行,和China的Income總和是一樣的,而
df的第0行,實際上也是China!!,也就是說,他已經按Country字段給merge好了!!!
-------------------------------------------------分割線,下面是原理和細節
Age | Income | |
---|---|---|
Country | ||
America | 250 | 40000 |
China | 5000 | 10000 |
India | 4321 | 5002 |
Japan | 250 | 50000 |
Age | Income | |||||
---|---|---|---|---|---|---|
min | mean | max | min | mean | max | |
Country | ||||||
America | 250 | 250.000000 | 250 | 40000 | 40000.000000 | 40000 |
China | 4321 | 4607.000000 | 5000 | 8000 | 9333.333333 | 10000 |
India | 1234 | 3188.333333 | 4321 | 5000 | 5000.666667 | 5002 |
Japan | 250 | 250.000000 | 250 | 50000 | 50000.000000 | 50000 |
下面這個就是單獨的取了一個字段的聚合結果
下面的結果表示,a[‘Age’]是單獨的一個字段的group,對單獨的一個字段的group我們可以對其聚合進行改名
比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其實那個’p’按正常來說應該寫成Age,但是此處這麼寫,就相當于将表中的Age字段直接更名為’p’了
num_agg = {'p':['min', 'mean', 'max']}
a['Age'].agg(num_agg)############ 隻拿出了Age這一列!!!!!,也就是一個字段的時候可以改名
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
p | |||
---|---|---|---|
min | mean | max | |
Country | |||
America | 250 | 250.000000 | 250 |
China | 4321 | 4607.000000 | 5000 |
India | 1234 | 3188.333333 | 4321 |
Japan | 250 | 250.000000 | 250 |
要注意 a是一個包含兩個字段的(Income和Age)的group,
agg函數不支援對含兩個以上字段的聚合進行名字更改(如下圖)
num_agg = {'Age_1':['min', 'mean', 'max']}
a.agg(num_agg)###### a包含了兩個以上的字段了!
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-7-8131bee46d39> in <module>()
1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)###### a包含了兩個以上的字段了!
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
408 k not in obj.columns):
409 raise KeyError(
--> 410 "Column '{col}' does not exist!".format(col=k))
411
412 arg = new_arg
KeyError: "Column 'Age_1' does not exist!"
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=False)
Income | Age | ||||
---|---|---|---|---|---|
min | max | min | mean | max | |
Country | |||||
America | 40000 | 40000 | 250 | 250.000000 | 250 |
China | 8000 | 10000 | 4321 | 4607.000000 | 5000 |
India | 5000 | 5002 | 1234 | 3188.333333 | 4321 |
Japan | 50000 | 50000 | 250 | 250.000000 | 250 |
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=True)
Income | Age | ||||
---|---|---|---|---|---|
min | max | min | mean | max | |
Country | |||||
America | 40000 | 40000 | 250 | 250.000000 | 250 |
China | 8000 | 10000 | 4321 | 4607.000000 | 5000 |
India | 5000 | 5002 | 1234 | 3188.333333 | 4321 |
Japan | 50000 | 50000 | 250 | 250.000000 | 250 |
num_agg = {'Age':['min', 'mean', 'max']}
a.agg(num_agg,as_index=False)
Age | |||
---|---|---|---|
min | mean | max | |
Country | |||
America | 250 | 250.000000 | 250 |
China | 4321 | 4607.000000 | 5000 |
India | 1234 | 3188.333333 | 4321 |
Japan | 250 | 250.000000 | 250 |
num_agg = {'Age':['min', 'mean', 'max']}
a['Age'].agg(num_agg,as_index=True)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
Age | |||
---|---|---|---|
min | mean | max | |
Country | |||
America | 250 | 250.000000 | 250 |
China | 4321 | 4607.000000 | 5000 |
India | 1234 | 3188.333333 | 4321 |
Japan | 250 | 250.000000 | 250 |
num_agg = {'min', 'mean', 'max'}#正規用法!!!!!!
a['Age'].agg(num_agg,as_index=False)
min | max | mean | |
---|---|---|---|
Country | |||
America | 250 | 250 | 250.000000 |
China | 4321 | 5000 | 4607.000000 |
India | 1234 | 4321 | 3188.333333 |
Japan | 250 | 250 | 250.000000 |