Python 数据帧数据行的加权平均数

Python 数据帧数据行的加权平均数,python,pandas,Python,Pandas,我想更有效地进行以下工作: 对于通过“名称”、“日期”、“时间”和附加指标变量“id”收集的数据,我想通过“id”计算“值”列的每日加权平均值,使用“权重”列作为平均值中的权重。原始数据的示例如下所示: df = pd.DataFrame({"name":["A", "A", "A" ,"A", "A" ,"A", "B", "B", "B", "B"], "date":["06/24/2014","06/24/2014","06/24/2014","06/24/2014","06/25/201

我想更有效地进行以下工作:

对于通过“名称”、“日期”、“时间”和附加指标变量“id”收集的数据,我想通过“id”计算“值”列的每日加权平均值,使用“权重”列作为平均值中的权重。原始数据的示例如下所示:

df = pd.DataFrame({"name":["A", "A", "A" ,"A", "A" ,"A", "B", "B", "B", "B"], "date":["06/24/2014","06/24/2014","06/24/2014","06/24/2014","06/25/2014","06/25/2014","06/25/2014","06/24/2014","06/24/2014","06/25/2014"], "time":['13:01:08', '13:46:53', '13:47:13', '13:49:11', '13:51:09', '14:35:03','15:35:00', '16:17:26', '16:17:26', '16:17:26'] , "id": ["B","B","S","S","S","B","S","B","S","S"], "value":[100.0, 98.0, 102.0, 80.0, 10.0, 200.0, 99.5, 10.0, 9.8, 10.0], "weights": [20835000.0, 3960000.0, 3960000.0, 3955000.0, 3960000.0, 5000000.0, 2000000.0, 6850.0, 162997.79999999999, 5000.0] })
应用此函数后,数据应仅包含“名称”、“id”和“w_avg”列

我使用groupby为此编写了以下函数:

df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()
我从中得到的输出如下:

id                        B          S
name date                             
A    06/24/2014   99.680581  91.006949
     06/25/2014  200.000000  10.000000
B    06/24/2014   10.000000   9.800000
     06/25/2014         NaN  99.276808
现在,对于每个“name”“date”,我想从“s”中减去id的“B”,得到一个“diff”列

为此,我创建了一个新的数据帧。要提取索引,我执行了以下操作:

name,date = zip(*list(df1.index.values))

df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
df2['diff'] = df2['B'] - df2['S']
你能建议一种更紧凑的方法吗?另外,我希望它能做得很快,因为我正在处理数百万行。groupby是最好的方法吗

谢谢你,

我想你可以使用,然后用减法:

df3 = df1.reset_index()

df3['diff'] = df3['B'] - df3['S']
print (df3)

id name        date           B          S        diff
0     A  06/24/2014   99.680581  91.006949    8.673632
1     A  06/25/2014  200.000000  10.000000  190.000000
2     B  06/24/2014   10.000000   9.800000    0.200000
3     B  06/25/2014         NaN  99.276808         NaN
编辑:

您的解决方案似乎是最快的
len(df)=100k

df = pd.concat([df]*10000).reset_index(drop=True)

In [114]: %timeit (df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x.value, weights=x.weights)))
10 loops, best of 3: 34.6 ms per loop

In [115]: %timeit ((df.value * df.weights).groupby([df.name,df.date,df.id]).sum() /  df.weights.groupby([df.name,df.date,df.id]).sum())
10 loops, best of 3: 38.4 ms per loop    
但最快的解决方案是:

df['value'] = df.value * df.weights
g = df.groupby(['name','date','id']) 
print (g['value'].sum() / g['weights'].sum())

In [125]: %timeit (a(df))
10 loops, best of 3: 20 ms per loop
测试代码

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    return (g['value'].sum() / g['weights'].sum())

print (a(df))   
df = pd.concat([df]*10000).reset_index(drop=True)
df5 = df.copy()

def orig(df):

    df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()   
    name,date = zip(*list(df1.index.values))

    df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
    df2['diff'] = df2['B'] - df2['S']
    df2 = df2[['name','date','B','S','diff']]
    return df2

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    df2 = (g['value'].sum() / g['weights'].sum()).unstack().reset_index()
    df2['diff'] = df2['B'] - df2['S']
    return df2    

print (orig(df5))    
print (a(df))  
编辑1:

将解决方案与原始方案进行比较:

In [132]: %timeit (orig(df5))
10 loops, best of 3: 37.4 ms per loop

In [133]: %timeit (a(df))
10 loops, best of 3: 22.7 ms per loop
测试代码

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    return (g['value'].sum() / g['weights'].sum())

print (a(df))   
df = pd.concat([df]*10000).reset_index(drop=True)
df5 = df.copy()

def orig(df):

    df1 = df.groupby(['name','date','id'], as_index=False).apply(lambda x: np.average(x['value'], weights=x['weights'])).unstack()   
    name,date = zip(*list(df1.index.values))

    df2 = pd.DataFrame({'name':name, 'date':date, 'B':list(df1['B']), 'S':list(df1['S'])})
    df2['diff'] = df2['B'] - df2['S']
    df2 = df2[['name','date','B','S','diff']]
    return df2

def a(df):
    df['value'] = df.value * df.weights
    g = df.groupby(['name','date','id']) 
    df2 = (g['value'].sum() / g['weights'].sum()).unstack().reset_index()
    df2['diff'] = df2['B'] - df2['S']
    return df2    

print (orig(df5))    
print (a(df))  

df1['diff']=df1['B']-df1['s']有什么问题吗?谢谢EdChum。你的主意不错,不过,我希望以后能有一个“正常”的数据框,这样我就可以对它进行进一步的操作。这有什么关系?除非你有具体的要求,否则它仍然可以工作。谢谢你,耶斯雷尔。groupby通常是做这类事情的最快方法吗?正如我在文章中提到的,我想对数百万行执行此操作,可能需要10分钟以上的时间。此外,重置索引是否会将groupby对象转换为“正常”数据帧?给我一点时间,我尝试了一些测试。重置索引从多索引创建新列并创建
nice
index-“
范围(len(df))
”我尝试了你的方法1400万行的“最快解决方案”,确实比我建议的要快得多,谢谢。