Python 仅使用先前的行迭代计算数据帧中每一行的聚合/统计信息的最有效方法是什么?

Python 仅使用先前的行迭代计算数据帧中每一行的聚合/统计信息的最有效方法是什么?,python,pandas,statistics,statistics-bootstrap,Python,Pandas,Statistics,Statistics Bootstrap,我正在进行一项分析,要求我计算1200万条记录数据集中每个记录的汇总统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,以及在时间戳创建的记录(此处称为两个小时桶)是

我正在进行一项分析,要求我计算1200万条记录数据集中每个记录的汇总统计数据和引导置信区间,这些数据仅基于共享位置id、小时、天列值的记录,以及在时间戳创建的记录(此处称为两个小时桶)是<要计算统计信息的记录的时间戳

我尝试了几种不同的方法,但很难找到一种能在合理的时间内返回所需计算结果的方法。任何关于更有效方法的建议都将不胜感激

以下是我编写的引导函数:

def bstrp_std(data):
    '95th percentile highest bootstrapped std estimate'
    n = len(data)
    boot = np.random.choice(data, size = (1000, n))
    stat = np.sort(np.std(boot,1))
    return(stat[950])

def bstrp_avg(data):
    '95th percentile highest bootstrapped mean estimate'
    n = len(data)
    boot = np.random.choice(data, size = (1000, n))
    stat = np.sort(np.mean(boot,1))
    return(stat[950])
我尝试过的方法:

      two_hour_buckets_x  location_id       day  hour  outgoing_payment_amount
2000          1434650400           59  Thursday    10                     0.00
2001          1434657600           59  Thursday    12                     0.00
2002          1434664800           59  Thursday    14                     0.00
2003          1434672000           59  Thursday    16                  1017.46
2004          1434679200           59  Thursday    18                     0.00
2005          1434686400           59  Thursday    20                     0.00
2006          1434693600           59  Thursday    22                     0.00
2007          1434700800           59    Friday     0                     0.00
2008          1434708000           59    Friday     2                     0.00
2009          1434715200           59    Friday     4                     0.00
2010          1434722400           59    Friday     6                     0.00
2011          1434729600           59    Friday     8                     0.00
2012          1434736800           59    Friday    10                     0.00
2013          1434744000           59    Friday    12                     0.00
2014          1434751200           59    Friday    14                     0.00
2015          1434758400           59    Friday    16                   528.22
2016          1434765600           59    Friday    18                   865.96
2017          1434772800           59    Friday    20                     0.00
2018          1434780000           59    Friday    22                     0.00
2019          1434787200           59  Saturday     0                     0.00
使用.iterrows()遍历数据帧 太慢了

for n, i in enumerate(clean.iterrows()):
    if n<1:
        df = clean[(clean['location_id']==i[1][1]) 
                   & (clean['day']==i[1][8]) 
                   & (clean['hour']==i[1][9]) 
                   & (clean['two_hour_buckets_x'] < i[1][0])].copy()

        df = df.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
                                                                                        'mean_bstp': lambda x: bstrp_avg(x),
                                                                                       'mean': np.mean,
                                                                                       'std': np.std, 
                                                                                        'sample_size':np.size}).reset_index()
        df['two_hour_buckets'] = i[1][0]

    else:
        temp = clean[(clean['location_id']==i[1][1]) 
                     & (clean['day']==i[1][8]) 
                     & (clean['hour']==i[1][9]) 
                     & (clean['two_hour_buckets_x'] < i[1][0])].copy()

        temp = temp.groupby(['location_id', 'day', 'hour'])['outgoing_payment_amount'].agg({'std_bstp':lambda x: bstrp_std(x),
                                                                                        'mean_bstp': lambda x: bstrp_avg(x),
                                                                                       'mean': np.mean,
                                                                                       'std': np.std, 
                                                                                        'sample_size':np.size}).reset_index()
        temp['two_hour_buckets'] = i[1][0]
        df = df.append(temp, ignore_index = True)
输入:

      two_hour_buckets_x  location_id       day  hour  outgoing_payment_amount
2000          1434650400           59  Thursday    10                     0.00
2001          1434657600           59  Thursday    12                     0.00
2002          1434664800           59  Thursday    14                     0.00
2003          1434672000           59  Thursday    16                  1017.46
2004          1434679200           59  Thursday    18                     0.00
2005          1434686400           59  Thursday    20                     0.00
2006          1434693600           59  Thursday    22                     0.00
2007          1434700800           59    Friday     0                     0.00
2008          1434708000           59    Friday     2                     0.00
2009          1434715200           59    Friday     4                     0.00
2010          1434722400           59    Friday     6                     0.00
2011          1434729600           59    Friday     8                     0.00
2012          1434736800           59    Friday    10                     0.00
2013          1434744000           59    Friday    12                     0.00
2014          1434751200           59    Friday    14                     0.00
2015          1434758400           59    Friday    16                   528.22
2016          1434765600           59    Friday    18                   865.96
2017          1434772800           59    Friday    20                     0.00
2018          1434780000           59    Friday    22                     0.00
2019          1434787200           59  Saturday     0                     0.00
所需输出

      location_id  two_hour_buckets_x_x  sample_size    std_bstp         std  \
2000           59            1435255200           24  476.922804  350.986069   
2001           59            1435262400           24  696.152358  449.504956   
2002           59            1435269600           24  487.779153  383.545528   
2003           59            1435276800           24  489.020190  401.858948   
2004           59            1435284000           24  670.082177  535.158428   
2005           59            1435291200           24  297.647022  183.711731   
2006           59            1435298400           24    0.000000    0.000000   
2007           59            1435305600           24    0.000000    0.000000   
2008           59            1435312800           24    0.000000    0.000000   
2009           59            1435320000           24    0.000000    0.000000   
2010           59            1435327200           24    0.000000    0.000000   
2011           59            1435334400           24  115.976509   71.582255   
2012           59            1435341600           24  336.998549  251.685526   
2013           59            1435348800           24  495.415309  384.295034   
2014           59            1435356000           25  276.204290  221.158691   
2015           59            1435363200           25  646.605050  465.187672   
2016           59            1435370400           25  606.740824  501.532447   
2017           59            1435377600           25  207.046545  153.245775   
2018           59            1435384800           25    0.000000    0.000000   
2019           59            1435392000           25    0.000000    0.000000   

       mean_bstp        mean  
2000  276.157500  150.517500  
2001  302.775000  142.515000  
2002  342.689167  197.455000  
2003  382.813333  246.694167  
2004  459.903333  290.807500  
2005  112.500000   37.500000  
2006    0.000000    0.000000  
2007    0.000000    0.000000  
2008    0.000000    0.000000  
2009    0.000000    0.000000  
2010    0.000000    0.000000  
2011   43.835000   14.611667  
2012  183.258333   95.989167  
2013  333.573333  192.307500  
2014  176.465600  102.017600  
2015  411.064000  247.736800  
2016  466.547200  290.933600  
2017  105.095200   51.756000  
2018    0.000000    0.000000  
2019    0.000000    0.000000

简短回答:您将在groupby/apply操作中执行此操作。现在,您应该发布20条样本记录和数据集的预期输出。对于方法1,
df=df。append(temp,ignore_index=True)
非常慢。您最好收集列表中的行,然后一次追加所有行。对于方法3,如果您有一个虚拟的
bstrp\u std
/
bstrp\u avg
函数(即,它们只返回一个常量),则计时有什么不同。这将允许您判断这些步骤的额外计算负担,我认为这是相当大的。Paul,感谢您的回答-我已经编辑了输入和输出。groupby/apply方法经过大约4个小时的处理,终于给了我一些结果。Alexander,你是对的,引导需要大量的时间,但是即使没有这些函数(仅np.std和np.mean),它仍然需要20分钟左右。
      location_id  two_hour_buckets_x_x  sample_size    std_bstp         std  \
2000           59            1435255200           24  476.922804  350.986069   
2001           59            1435262400           24  696.152358  449.504956   
2002           59            1435269600           24  487.779153  383.545528   
2003           59            1435276800           24  489.020190  401.858948   
2004           59            1435284000           24  670.082177  535.158428   
2005           59            1435291200           24  297.647022  183.711731   
2006           59            1435298400           24    0.000000    0.000000   
2007           59            1435305600           24    0.000000    0.000000   
2008           59            1435312800           24    0.000000    0.000000   
2009           59            1435320000           24    0.000000    0.000000   
2010           59            1435327200           24    0.000000    0.000000   
2011           59            1435334400           24  115.976509   71.582255   
2012           59            1435341600           24  336.998549  251.685526   
2013           59            1435348800           24  495.415309  384.295034   
2014           59            1435356000           25  276.204290  221.158691   
2015           59            1435363200           25  646.605050  465.187672   
2016           59            1435370400           25  606.740824  501.532447   
2017           59            1435377600           25  207.046545  153.245775   
2018           59            1435384800           25    0.000000    0.000000   
2019           59            1435392000           25    0.000000    0.000000   

       mean_bstp        mean  
2000  276.157500  150.517500  
2001  302.775000  142.515000  
2002  342.689167  197.455000  
2003  382.813333  246.694167  
2004  459.903333  290.807500  
2005  112.500000   37.500000  
2006    0.000000    0.000000  
2007    0.000000    0.000000  
2008    0.000000    0.000000  
2009    0.000000    0.000000  
2010    0.000000    0.000000  
2011   43.835000   14.611667  
2012  183.258333   95.989167  
2013  333.573333  192.307500  
2014  176.465600  102.017600  
2015  411.064000  247.736800  
2016  466.547200  290.933600  
2017  105.095200   51.756000  
2018    0.000000    0.000000  
2019    0.000000    0.000000