Python 基于多个时间序列的分组和聚合
我是python和pandas新手,对于如何编写一个短函数有一些基本问题,该函数接受pd.Dataframe并返回按月份分组的相对值 示例数据:Python 基于多个时间序列的分组和聚合,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我是python和pandas新手,对于如何编写一个短函数有一些基本问题,该函数接受pd.Dataframe并返回按月份分组的相对值 示例数据: import pandas as pd from datetime import datetime import numpy as np date_rng = pd.date_range(start='2019-01-01', end='2019-03-31', freq='D') df = pd.DataFrame(date_rng, column
import pandas as pd
from datetime import datetime
import numpy as np
date_rng = pd.date_range(start='2019-01-01', end='2019-03-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value_in_question'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date',inplace=True)
df.head()
value_in_question
date
2019-01-01 40
2019-01-02 86
2019-01-03 46
2019-01-04 75
2019-01-05 35
def absolute_to_relative(df):
"""
set_index before using
"""
return df.div(df.sum(), axis=1).mul(100)
relative_df = absolute_to_relative(df)
relative_df.head()
value_in_question
date
2019-01-01 0.895055
2019-01-02 1.924368
2019-01-03 1.029313
2019-01-04 1.678228
2019-01-05 0.783173
而不是求列和,然后按此值对每行进行分段,
我想每个月都要一笔钱。最终df应具有相同的
形状和形式,但行值与月和相关
旧的:
新的:
因此,我尝试了以下方法,返回NA作为问题中的值:
def absolute_to_relative_agg(df, agg):
"""
set_index before using
"""
return df.div(df.groupby([pd.Grouper(freq=agg)]).sum(), axis=1)
相对_df=绝对_至_相对(df,'M')
对于总和,您可以
groupby
索引月份:
In [31]: month_sum = df.groupby(df.index.strftime('%Y%m')).sum()
...: month_sum
...:
Out[31]:
value_in_question
201901 1386
201902 1440
201903 1358
然后,您可以使用.map
将月份与数据框的正确行对齐:
In [32]: map_sum = df.index.strftime('%Y%m').map(month_sum['value_in_question'])
...: map_sum
...:
Out[32]:
Int64Index([1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1440, 1440,
1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
1440, 1440, 1440, 1440, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
1358, 1358],
dtype='int64')
然后你只需要做除法:
In [33]: df['value_in_question'].div(map_sum)
Out[33]:
date
2019-01-01 0.012987
2019-01-02 0.018759
2019-01-03 0.000000
2019-01-04 0.056277
2019-01-05 0.019481
...
2019-03-27 0.031664
2019-03-28 0.007364
2019-03-29 0.050074
2019-03-30 0.033873
2019-03-31 0.005155
Name: value_in_question, Length: 90, dtype: float64
def absolute_to_relative_agg(df, agg):
"""
set_index before using
"""
return df.div(df.groupby([pd.Grouper(freq=agg)]).transform('sum'))
relative_df = absolute_to_relative_agg(df, 'M')
使用与原始数据相同的DatatimeIndex
对序列/日期框进行聚合,以便可能的划分:
In [33]: df['value_in_question'].div(map_sum)
Out[33]:
date
2019-01-01 0.012987
2019-01-02 0.018759
2019-01-03 0.000000
2019-01-04 0.056277
2019-01-05 0.019481
...
2019-03-27 0.031664
2019-03-28 0.007364
2019-03-29 0.050074
2019-03-30 0.033873
2019-03-31 0.005155
Name: value_in_question, Length: 90, dtype: float64
def absolute_to_relative_agg(df, agg):
"""
set_index before using
"""
return df.div(df.groupby([pd.Grouper(freq=agg)]).transform('sum'))
relative_df = absolute_to_relative_agg(df, 'M')
调用函数的另一种方式是:
使用带freq='M'的石斑鱼
代码是:
relative_df = df.groupby(pd.Grouper(freq='M'))\
.value_in_question.apply(lambda x: x.div(x.sum()).mul(100))
它返回索引与原始数据帧中相同的序列
值等于当前月份的相关值
relative_df = df.pipe(absolute_to_relative_agg, 'M')
print (relative_df)
value_in_question
date
2019-01-01 0.032901
2019-01-02 0.045862
2019-01-03 0.048853
2019-01-04 0.008475
2019-01-05 0.041376
...
2019-03-27 0.062049
2019-03-28 0.002165
2019-03-29 0.048341
2019-03-30 0.007937
2019-03-31 0.015152
[90 rows x 1 columns]
relative_df = df.groupby(pd.Grouper(freq='M'))\
.value_in_question.apply(lambda x: x.div(x.sum()).mul(100))