Python 仅使用先前的行迭代计算数据帧中每一行的聚合/统计信息的最有效方法是什么?
我正在进行一项分析,要求我计算1200万条记录数据集中每个记录的汇总统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,以及在时间戳创建的记录(此处称为两个小时桶)是<要计算统计信息的记录的时间戳 我尝试了几种不同的方法,但很难找到一种能在合理的时间内返回所需计算结果的方法。任何关于更有效方法的建议都将不胜感激 以下是我编写的引导函数:Python 仅使用先前的行迭代计算数据帧中每一行的聚合/统计信息的最有效方法是什么?,python,pandas,statistics,statistics-bootstrap,Python,Pandas,Statistics,Statistics Bootstrap,我正在进行一项分析,要求我计算1200万条记录数据集中每个记录的汇总统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,以及在时间戳创建的记录(此处称为两个小时桶)是
def bstrp_std(data):
'95th percentile highest bootstrapped std estimate'
n = len(data)
boot = np.random.choice(data, size = (1000, n))
stat = np.sort(np.std(boot,1))
return(stat[950])
def bstrp_avg(data):
'95th percentile highest bootstrapped mean estimate'
n = len(data)
boot = np.random.choice(data, size = (1000, n))
stat = np.sort(np.mean(boot,1))
return(stat[950])
我尝试过的方法:
two_hour_buckets_x location_id day hour outgoing_payment_amount
2000 1434650400 59 Thursday 10 0.00
2001 1434657600 59 Thursday 12 0.00
2002 1434664800 59 Thursday 14 0.00
2003 1434672000 59 Thursday 16 1017.46
2004 1434679200 59 Thursday 18 0.00
2005 1434686400 59 Thursday 20 0.00
2006 1434693600 59 Thursday 22 0.00
2007 1434700800 59 Friday 0 0.00
2008 1434708000 59 Friday 2 0.00
2009 1434715200 59 Friday 4 0.00
2010 1434722400 59 Friday 6 0.00
2011 1434729600 59 Friday 8 0.00
2012 1434736800 59 Friday 10 0.00
2013 1434744000 59 Friday 12 0.00
2014 1434751200 59 Friday 14 0.00
2015 1434758400 59 Friday 16 528.22
2016 1434765600 59 Friday 18 865.96
2017 1434772800 59 Friday 20 0.00
2018 1434780000 59 Friday 22 0.00
2019 1434787200 59 Saturday 0 0.00
使用.iterrows()遍历数据帧
太慢了
for n, i in enumerate(clean.iterrows()):
if n<1:
df = clean[(clean['location_id']==i[1][1])
& (clean['day']==i[1][8])
& (clean['hour']==i[1][9])
& (clean['two_hour_buckets_x'] < i[1][0])].copy()
df = df.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
'mean_bstp': lambda x: bstrp_avg(x),
'mean': np.mean,
'std': np.std,
'sample_size':np.size}).reset_index()
df['two_hour_buckets'] = i[1][0]
else:
temp = clean[(clean['location_id']==i[1][1])
& (clean['day']==i[1][8])
& (clean['hour']==i[1][9])
& (clean['two_hour_buckets_x'] < i[1][0])].copy()
temp = temp.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
'mean_bstp': lambda x: bstrp_avg(x),
'mean': np.mean,
'std': np.std,
'sample_size':np.size}).reset_index()
temp['two_hour_buckets'] = i[1][0]
df = df.append(temp, ignore_index = True)
输入:
two_hour_buckets_x location_id day hour outgoing_payment_amount
2000 1434650400 59 Thursday 10 0.00
2001 1434657600 59 Thursday 12 0.00
2002 1434664800 59 Thursday 14 0.00
2003 1434672000 59 Thursday 16 1017.46
2004 1434679200 59 Thursday 18 0.00
2005 1434686400 59 Thursday 20 0.00
2006 1434693600 59 Thursday 22 0.00
2007 1434700800 59 Friday 0 0.00
2008 1434708000 59 Friday 2 0.00
2009 1434715200 59 Friday 4 0.00
2010 1434722400 59 Friday 6 0.00
2011 1434729600 59 Friday 8 0.00
2012 1434736800 59 Friday 10 0.00
2013 1434744000 59 Friday 12 0.00
2014 1434751200 59 Friday 14 0.00
2015 1434758400 59 Friday 16 528.22
2016 1434765600 59 Friday 18 865.96
2017 1434772800 59 Friday 20 0.00
2018 1434780000 59 Friday 22 0.00
2019 1434787200 59 Saturday 0 0.00
所需输出
location_id two_hour_buckets_x_x sample_size std_bstp std \
2000 59 1435255200 24 476.922804 350.986069
2001 59 1435262400 24 696.152358 449.504956
2002 59 1435269600 24 487.779153 383.545528
2003 59 1435276800 24 489.020190 401.858948
2004 59 1435284000 24 670.082177 535.158428
2005 59 1435291200 24 297.647022 183.711731
2006 59 1435298400 24 0.000000 0.000000
2007 59 1435305600 24 0.000000 0.000000
2008 59 1435312800 24 0.000000 0.000000
2009 59 1435320000 24 0.000000 0.000000
2010 59 1435327200 24 0.000000 0.000000
2011 59 1435334400 24 115.976509 71.582255
2012 59 1435341600 24 336.998549 251.685526
2013 59 1435348800 24 495.415309 384.295034
2014 59 1435356000 25 276.204290 221.158691
2015 59 1435363200 25 646.605050 465.187672
2016 59 1435370400 25 606.740824 501.532447
2017 59 1435377600 25 207.046545 153.245775
2018 59 1435384800 25 0.000000 0.000000
2019 59 1435392000 25 0.000000 0.000000
mean_bstp mean
2000 276.157500 150.517500
2001 302.775000 142.515000
2002 342.689167 197.455000
2003 382.813333 246.694167
2004 459.903333 290.807500
2005 112.500000 37.500000
2006 0.000000 0.000000
2007 0.000000 0.000000
2008 0.000000 0.000000
2009 0.000000 0.000000
2010 0.000000 0.000000
2011 43.835000 14.611667
2012 183.258333 95.989167
2013 333.573333 192.307500
2014 176.465600 102.017600
2015 411.064000 247.736800
2016 466.547200 290.933600
2017 105.095200 51.756000
2018 0.000000 0.000000
2019 0.000000 0.000000
简短回答:您将在groupby/apply操作中执行此操作。现在,您应该发布20条样本记录和数据集的预期输出。对于方法1,
df=df。append(temp,ignore_index=True)
非常慢。您最好收集列表中的行,然后一次追加所有行。对于方法3,如果您有一个虚拟的bstrp\u std
/bstrp\u avg
函数(即,它们只返回一个常量),则计时有什么不同。这将允许您判断这些步骤的额外计算负担,我认为这是相当大的。Paul,感谢您的回答-我已经编辑了输入和输出。groupby/apply方法经过大约4个小时的处理,终于给了我一些结果。Alexander,你是对的,引导需要大量的时间,但是即使没有这些函数(仅np.std和np.mean),它仍然需要20分钟左右。
location_id two_hour_buckets_x_x sample_size std_bstp std \
2000 59 1435255200 24 476.922804 350.986069
2001 59 1435262400 24 696.152358 449.504956
2002 59 1435269600 24 487.779153 383.545528
2003 59 1435276800 24 489.020190 401.858948
2004 59 1435284000 24 670.082177 535.158428
2005 59 1435291200 24 297.647022 183.711731
2006 59 1435298400 24 0.000000 0.000000
2007 59 1435305600 24 0.000000 0.000000
2008 59 1435312800 24 0.000000 0.000000
2009 59 1435320000 24 0.000000 0.000000
2010 59 1435327200 24 0.000000 0.000000
2011 59 1435334400 24 115.976509 71.582255
2012 59 1435341600 24 336.998549 251.685526
2013 59 1435348800 24 495.415309 384.295034
2014 59 1435356000 25 276.204290 221.158691
2015 59 1435363200 25 646.605050 465.187672
2016 59 1435370400 25 606.740824 501.532447
2017 59 1435377600 25 207.046545 153.245775
2018 59 1435384800 25 0.000000 0.000000
2019 59 1435392000 25 0.000000 0.000000
mean_bstp mean
2000 276.157500 150.517500
2001 302.775000 142.515000
2002 342.689167 197.455000
2003 382.813333 246.694167
2004 459.903333 290.807500
2005 112.500000 37.500000
2006 0.000000 0.000000
2007 0.000000 0.000000
2008 0.000000 0.000000
2009 0.000000 0.000000
2010 0.000000 0.000000
2011 43.835000 14.611667
2012 183.258333 95.989167
2013 333.573333 192.307500
2014 176.465600 102.017600
2015 411.064000 247.736800
2016 466.547200 290.933600
2017 105.095200 51.756000
2018 0.000000 0.000000
2019 0.000000 0.000000