Python 数据帧数据行的加权平均数
我想更有效地进行以下工作: 对于通过“名称”、“日期”、“时间”和附加指标变量“id”收集的数据,我想通过“id”计算“值”列的每日加权平均值,使用“权重”列作为平均值中的权重。原始数据的示例如下所示:Python 数据帧数据行的加权平均数,python,pandas,Python,Pandas,我想更有效地进行以下工作: 对于通过“名称”、“日期”、“时间”和附加指标变量“id”收集的数据,我想通过“id”计算“值”列的每日加权平均值,使用“权重”列作为平均值中的权重。原始数据的示例如下所示: df = pd.DataFrame({"name":["A", "A", "A" ,"A", "A" ,"A", "B", "B", "B", "B"], "date":["06/24/2014","06/24/2014","06/24/2014","06/24/2014","06/25/201
df = pd.DataFrame({"name":["A", "A", "A" ,"A", "A" ,"A", "B", "B", "B", "B"], "date":["06/24/2014","06/24/2014","06/24/2014","06/24/2014","06/25/2014","06/25/2014","06/25/2014","06/24/2014","06/24/2014","06/25/2014"], "time":['13:01:08', '13:46:53', '13:47:13', '13:49:11', '13:51:09', '14:35:03','15:35:00', '16:17:26', '16:17:26', '16:17:26'] , "id": ["B","B","S","S","S","B","S","B","S","S"], "value":[100.0, 98.0, 102.0, 80.0, 10.0, 200.0, 99.5, 10.0, 9.8, 10.0], "weights": [20835000.0, 3960000.0, 3960000.0, 3955000.0, 3960000.0, 5000000.0, 2000000.0, 6850.0, 162997.79999999999, 5000.0] })
应用此函数后,数据应仅包含“名称”、“id”和“w_avg”列
我使用groupby为此编写了以下函数:
df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()
我从中得到的输出如下:
id B S
name date
A 06/24/2014 99.680581 91.006949
06/25/2014 200.000000 10.000000
B 06/24/2014 10.000000 9.800000
06/25/2014 NaN 99.276808
现在,对于每个“name”“date”,我想从“s”中减去id的“B”,得到一个“diff”列
为此,我创建了一个新的数据帧。要提取索引,我执行了以下操作:
name,date = zip(*list(df1.index.values))
df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
df2['diff'] = df2['B'] - df2['S']
你能建议一种更紧凑的方法吗?另外,我希望它能做得很快,因为我正在处理数百万行。groupby是最好的方法吗
谢谢你,我想你可以使用,然后用减法:
df3 = df1.reset_index()
df3['diff'] = df3['B'] - df3['S']
print (df3)
id name date B S diff
0 A 06/24/2014 99.680581 91.006949 8.673632
1 A 06/25/2014 200.000000 10.000000 190.000000
2 B 06/24/2014 10.000000 9.800000 0.200000
3 B 06/25/2014 NaN 99.276808 NaN
编辑:
您的解决方案似乎是最快的len(df)=100k
:
df = pd.concat([df]*10000).reset_index(drop=True)
In [114]: %timeit (df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x.value, weights=x.weights)))
10 loops, best of 3: 34.6 ms per loop
In [115]: %timeit ((df.value * df.weights).groupby([df.name,df.date,df.id]).sum() / df.weights.groupby([df.name,df.date,df.id]).sum())
10 loops, best of 3: 38.4 ms per loop
但最快的解决方案是:
df['value'] = df.value * df.weights
g = df.groupby(['name','date','id'])
print (g['value'].sum() / g['weights'].sum())
In [125]: %timeit (a(df))
10 loops, best of 3: 20 ms per loop
测试代码:
def a(df):
df['value'] = df.value * df.weights
g = df.groupby(['name','date','id'])
return (g['value'].sum() / g['weights'].sum())
print (a(df))
df = pd.concat([df]*10000).reset_index(drop=True)
df5 = df.copy()
def orig(df):
df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()
name,date = zip(*list(df1.index.values))
df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
df2['diff'] = df2['B'] - df2['S']
df2 = df2[['name','date','B','S','diff']]
return df2
def a(df):
df['value'] = df.value * df.weights
g = df.groupby(['name','date','id'])
df2 = (g['value'].sum() / g['weights'].sum()).unstack().reset_index()
df2['diff'] = df2['B'] - df2['S']
return df2
print (orig(df5))
print (a(df))
编辑1:
将解决方案与原始方案进行比较:
In [132]: %timeit (orig(df5))
10 loops, best of 3: 37.4 ms per loop
In [133]: %timeit (a(df))
10 loops, best of 3: 22.7 ms per loop
测试代码:
def a(df):
df['value'] = df.value * df.weights
g = df.groupby(['name','date','id'])
return (g['value'].sum() / g['weights'].sum())
print (a(df))
df = pd.concat([df]*10000).reset_index(drop=True)
df5 = df.copy()
def orig(df):
df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()
name,date = zip(*list(df1.index.values))
df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
df2['diff'] = df2['B'] - df2['S']
df2 = df2[['name','date','B','S','diff']]
return df2
def a(df):
df['value'] = df.value * df.weights
g = df.groupby(['name','date','id'])
df2 = (g['value'].sum() / g['weights'].sum()).unstack().reset_index()
df2['diff'] = df2['B'] - df2['S']
return df2
print (orig(df5))
print (a(df))
df1['diff']=df1['B']-df1['s']有什么问题吗?谢谢EdChum。你的主意不错,不过,我希望以后能有一个“正常”的数据框,这样我就可以对它进行进一步的操作。这有什么关系?除非你有具体的要求,否则它仍然可以工作。谢谢你,耶斯雷尔。groupby通常是做这类事情的最快方法吗?正如我在文章中提到的,我想对数百万行执行此操作,可能需要10分钟以上的时间。此外,重置索引是否会将groupby对象转换为“正常”数据帧?给我一点时间,我尝试了一些测试。重置索引从多索引创建新列并创建
nice
index-“范围(len(df))
”我尝试了你的方法1400万行的“最快解决方案”,确实比我建议的要快得多,谢谢。