Python 在groupby()对象上应用()的次数要比组运行的次数多得多
我继承了一些我正试图优化的Python 在groupby()对象上应用()的次数要比组运行的次数多得多,python,pandas,Python,Pandas,我继承了一些我正试图优化的pandas代码。已使用创建了一个数据帧,结果 results = pd.DataFrame(columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase']) for plan in my_df['plan_name'].unique(): df1 = df[df['plan_name'] == plan]]
pandas
代码。已使用创建了一个数据帧,结果
results = pd.DataFrame(columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase'])
for plan in my_df['plan_name'].unique():
df1 = df[df['plan_name'] == plan]]
df1['volume'].fillna(0, inplace=True)
df1['change'] = df1['idr'] - df1['idr'].shift(1)
df1['change'].fillna(0, inplace=True)
df1['impact'] = df1['change'] * df1['volume']
describe_impact = df1['impact'].describe()
describe_change = df1['change'].describe()
results = results.append({'plan': plan,
'volume': df1['volume'].mean(),
'avg_denial_increase': describe_impact['mean'],
'std_dev_impact': describe_impact['std'],
'avg_idr_increase': describe_change['mean'],
'std_dev_idr_increase': describe_change['std']},
ignore_index=True)
我的第一个想法是将for循环下的所有内容移动到一个单独的函数中,get\u results\u for\u plan
,并使用pandas
groupby()
和apply()
方法。但事实证明,他的速度更慢。运行
%lprun -f get_results_for_plan my_df.groupby('plan_name', sort=False, as_index=False).apply(get_results_for_plan)
返回
Timer unit: 1e-06 s
Total time: 0.77167 s
File: <ipython-input-46-7c36b3902812>
Function: get_results_for_plan at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_results_for_plan(plan_df):
2 94 33221.0 353.4 4.3 plan = plan_df.iloc[0]['plan_name']
3 94 25901.0 275.5 3.4 plan_df['volume'].fillna(0, inplace=True)
4 94 75765.0 806.0 9.8 plan_df['change'] = plan_df['idr'] - plan_df['idr'].shift(1)
5 93 38653.0 415.6 5.0 plan_df['change'].fillna(0, inplace=True)
6 93 57088.0 613.8 7.4 plan_df['impact'] = plan_df['change'] * plan_df['volume']
7 93 204828.0 2202.5 26.5 describe_impact = plan_df['impact'].describe()
8 93 201127.0 2162.7 26.1 describe_change = plan_df['change'].describe()
9 93 129.0 1.4 0.0 return pd.DataFrame({'plan': plan,
10 93 21703.0 233.4 2.8 'volume': plan_df['volume'].mean(),
11 93 4291.0 46.1 0.6 'avg_denial_increase': describe_impact['mean'],
12 93 1957.0 21.0 0.3 'std_dev_impact': describe_impact['std'],
13 93 2912.0 31.3 0.4 'avg_idr_increase': describe_change['mean'],
14 93 1783.0 19.2 0.2 'std_dev_idr_increase': describe_change['std']},
15 93 102312.0 1100.1 13.3 index=[0])
她72岁。那么为什么这些线每次被击中94或93次呢?(这可能与问题有关,但在这种情况下,我希望命中计数为num_groups+1
)
更新:在上面的%lprun
调用groupby()
中,删除sort=False
将第2-6行的行命中率降低到80,其余的行命中率降低到79。仍然比我想象的要多,但是更好一点
第二个问题:有没有更好的方法来优化这段代码?以下是我在评论中的大致意思:
def append_to_list():
l = []
for _ in range(10000):
l.append(np.random.random(4))
return pd.DataFrame(l, columns=list('abcd'))
def append_to_df():
cols = list('abcd')
df = pd.DataFrame(columns=cols)
for _ in range(10000):
df = df.append({k: v for k, v in zip(cols, np.random.random(4))},
ignore_index=True)
return df
%timeit append_to_list
# 31.5 ms ± 925 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit append_to_df
# 9.05 s ± 337 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
因此,您的代码最大的好处可能是:
results = []
for plan in my_df['plan_name'].unique():
df1 = df[df['plan_name'] == plan]]
df1['volume'].fillna(0, inplace=True)
df1['change'] = df1['idr'] - df1['idr'].shift(1)
df1['change'].fillna(0, inplace=True)
df1['impact'] = df1['change'] * df1['volume']
describe_impact = df1['impact'].describe()
describe_change = df1['change'].describe()
results.append((plan,
df1['volume'].mean(),
describe_impact['mean'],
describe_impact['std'],
describe_change['mean'],
describe_change['std']))
results = pd.DataFrame(results, columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase'])
一个小的改进可能来自于同时填充“变化”和“数量”。同样适用于
描述
。同时在两个列上应用该方法将减少调用的数量,我想运行时间也会减少time@Gio我已经试过了,两种组合都带来了惊人的时间增长!我认为一个主要的瓶颈是附加到数据帧。您可以尝试附加到列表,完成后将列表转换为数据帧。
results = []
for plan in my_df['plan_name'].unique():
df1 = df[df['plan_name'] == plan]]
df1['volume'].fillna(0, inplace=True)
df1['change'] = df1['idr'] - df1['idr'].shift(1)
df1['change'].fillna(0, inplace=True)
df1['impact'] = df1['change'] * df1['volume']
describe_impact = df1['impact'].describe()
describe_change = df1['change'].describe()
results.append((plan,
df1['volume'].mean(),
describe_impact['mean'],
describe_impact['std'],
describe_change['mean'],
describe_change['std']))
results = pd.DataFrame(results, columns=['plan','volume','avg_denial_increase','std_dev_impact', 'avg_idr_increase', 'std_dev_idr_increase'])