Python 对于循环通过列,数据帧效率低下

Python 对于循环通过列,数据帧效率低下,python,pandas,dataframe,for-loop,if-statement,Python,Pandas,Dataframe,For Loop,If Statement,我有每个单元格和日期的降水数据(1800行和15k列) 486335 486336 486337 2019-07-03 13:35:54.445 0 2 22 2019-07-04 13:35:54.445 0 1 1 2019-07-05 13:35:54.445 16 8 22 2019-07-06 13:35:54.445

我有每个单元格和日期的降水数据(1800行和15k列)

                          486335  486336  486337
2019-07-03 13:35:54.445       0       2      22
2019-07-04 13:35:54.445       0       1       1
2019-07-05 13:35:54.445      16       8      22
2019-07-06 13:35:54.445       0       0       0
2019-07-07 13:35:54.445       0      11       0

我想找出达到特定降雨量(>15mm)的日期,并计算该事件发生后降雨量减少的天数(您可以避免行上的迭代,因为它不能很好地扩展到大型数据帧

这是一种不同的方法,不确定它是否对您的完整数据帧更有效:

periods=[]
for cell in df.columns:
    sub = pd.DataFrame({'amount': df[cell].values}, index=df.index)
    sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf],
                         labels=[0, 1]).astype(np.float)
    sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum()
    sub.flag.ffill(inplace=True)
    x = sub[sub.flag>0].reset_index().groupby('flag').agg(
        {'index':['min', 'max'], 'amount': 'sum'})
    x.columns = ['start', 'end', 'amount']
    x['period_range'] = (x.end - x.start).dt.days + 1
    x['cell'] = cell
    x.reindex(columns=['start', 'end', 'period_range', 'cell'])
    periods.append(x)

resul = pd.concat(periods).reset_index(drop=True)

因为我没有完整的数据集,所以我不能说什么占用了时间,但我想这是因为在获取周期和在循环中执行的排序操作时进行了索引访问。 它在逻辑上应该与您的代码等效,除了一些更改:

duration = 0 #days with no or less than pp_max_1 rain 
count = False

index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
dry_periods= list()

for counter_columns, column in enumerate(df.columns, 1):
    for period, y in df[column].items():
        if not count and y >= pp_min:
            duration += 1
            count = True
            start_period = period
            event_amount = y
            pp_sum += y
        elif count and (y >= pp_min or y >= pp_max_1):
            end_period = period
            dry_periods.append({
                    "start_period":  start_period ,
                    "end_period":    end_period,
                    "period_range":  duration,
                    "period_amount": pp_sum ,
                    "event_amount":  event_amount, 
                    "cell":          column})
            duration = 0
            count =    False
            pp_sum =   0
        elif count and pp <= pp_max_1:
            duration += 1
            pp_sum   += y
    print("column :",counter_columns, "finished")

dry_periods.sort(key=lambda record: record['period_range'])
print(dry_periods)

上面。我觉得这很可疑,但这只是你程序中的重写条件。如果可以的话,也许你可以删除其中一个比较,因为我猜pp_min@Hanggy问:列中是什么?(因为声誉原因不能评论)总的来说,它看起来并没有那么糟糕。我只能想象,索引访问可能会消耗时间。如果您将df[x]中的“y”替换为“period”,df[x]中的“y”,并将其替换为“items():”,然后在当前执行索引访问的所有位置都设置“start\u period=period”(同样地,结束\u period),那么性能会发生怎样的变化?我希望它的性能更好。这样,你也可以去掉你的“迭代”-变量和与之相关的技术代码。啊,还有一件事,我认为你也可以去掉“if iteration==counter:”。我更愿意把这里执行的代码添加到外循环(内循环后面)。这可能不会节省大量的运行时间,但它使代码更易于理解和维护,因为在进入循环之前,您不需要知道代码执行了多少次迭代(我认为您也可以通过这种方式去掉“counter”变量).python中的循环是内存效率最高的解决方案,请始终尝试构建数据帧的矢量化解决方案,在您的数据帧中,rain>15和rain完全矢量化的现象是不可能的,但是serge ballesta为您提供了一个很好的方法nice!您确实需要上面的ffill吗?如果您跳过e loc[sub.flag>0,也加零?周期长度是从第一个周期开始到最后一个周期结束的长度,对吗?@jottbe:问题是,0.11到15之间的任何值都会中断当前的干燥周期,而不会启动新的组。周期长度是从第一个周期开始到最后一个周期结束的天数f事件和该事件的最后一天+1。很好的解决方案。到目前为止,我还没有遇到pd.cut。我相信它会让我的生活更简单。但当多个事件发生在同一列中时,你如何处理这种情况?或者它已经做到了?
periods=[]
for cell in df.columns:
    sub = pd.DataFrame({'amount': df[cell].values}, index=df.index)
    sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf],
                         labels=[0, 1]).astype(np.float)
    sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum()
    sub.flag.ffill(inplace=True)
    x = sub[sub.flag>0].reset_index().groupby('flag').agg(
        {'index':['min', 'max'], 'amount': 'sum'})
    x.columns = ['start', 'end', 'amount']
    x['period_range'] = (x.end - x.start).dt.days + 1
    x['cell'] = cell
    x.reindex(columns=['start', 'end', 'period_range', 'cell'])
    periods.append(x)

resul = pd.concat(periods).reset_index(drop=True)
duration = 0 #days with no or less than pp_max_1 rain 
count = False

index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
dry_periods= list()

for counter_columns, column in enumerate(df.columns, 1):
    for period, y in df[column].items():
        if not count and y >= pp_min:
            duration += 1
            count = True
            start_period = period
            event_amount = y
            pp_sum += y
        elif count and (y >= pp_min or y >= pp_max_1):
            end_period = period
            dry_periods.append({
                    "start_period":  start_period ,
                    "end_period":    end_period,
                    "period_range":  duration,
                    "period_amount": pp_sum ,
                    "event_amount":  event_amount, 
                    "cell":          column})
            duration = 0
            count =    False
            pp_sum =   0
        elif count and pp <= pp_max_1:
            duration += 1
            pp_sum   += y
    print("column :",counter_columns, "finished")

dry_periods.sort(key=lambda record: record['period_range'])
print(dry_periods)
elif count and (y >= pp_min or y >= pp_max_1):