如何提高python数据帧中平均计算的性能_Python_Python 3.x_Pandas

如何提高python数据帧中平均计算的性能

python python-3.x pandas

如何提高python数据帧中平均计算的性能,python,python-3.x,pandas,Python,Python 3.x,Pandas,我试图改进当前代码段的性能，通过循环遍历一个数据帧（数据帧'r'），并根据条件从另一个数据帧（数据帧'p'）中找到平均值我想从数据帧“p”中找到所有值（列“Val”）的平均值，其中（r.RefDate=p.RefDate）&（r.Item=p.Item）&（p.StartDate>=r.StartDate）&（p.EndDate=r.loc[I]['PeriodStartDate']）和（p['EndDate']通过使用iterrows，我成功地提高了性能，尽管可能还有更快的方法 for i

我试图改进当前代码段的性能，通过循环遍历一个数据帧（数据帧'r'），并根据条件从另一个数据帧（数据帧'p'）中找到平均值

我想从数据帧“p”中找到所有值（列“Val”）的平均值，其中（r.RefDate=p.RefDate）&（r.Item=p.Item）&（p.StartDate>=r.StartDate）&（p.EndDate=r.loc[I]['PeriodStartDate']）和

（p['EndDate']通过使用iterrows，我成功地提高了性能，尽管可能还有更快的方法

for index, row in r.iterrows():      
    avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) & 
                         (p['EndDate'] <= row.PeriodEndDate) &
                         (p['RefDate'] == row.RefDate) &
                         (p['Item'] == row.Item))].mean()

    r.loc[index, 'AvgVal'] = avg_price

对于r.ItErrors（）中的索引行：
平均价格=p['Val'].loc[（（p['StartDate']>=行周期开始日期）和
（p['EndDate']第一个变化是生成r数据帧，包括PeriodStartDate和
PeriodEndDate被创建为datetime，请参见
起始代码，由我更改：
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
    'PeriodStartDate': pd.to_datetime('2019-10-25'),
    'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0}) 

为了获得更好的速度，我将数据帧中的索引设置为RefDate和Item
（两列相等比较）并按索引排序：
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)

通过这种方式，通过索引进行访问的速度要快得多
然后我定义了下面的函数来计算行的平均值
从p“相关”到r的当前行：
唯一要做的就是应用这个函数（对r中的每一行）和
将结果保存在AvgVal中：
使用%timeit，我比较了EdH提出的代码和我的代码的执行时间
结果缩短了近10倍
自己检查一下
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
    'PeriodStartDate': pd.to_datetime('2019-10-25'),
    'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0}) 

p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)

def myMean(row):
    pp = p.loc[row.name]
    return pp[pp.StartDate.ge(row.PeriodStartDate) &
        pp.EndDate.le(row.PeriodEndDate)].Val.mean()

r.AvgVal = r.apply(myMean2, axis=1)