Python 熊猫在时间序列中计数为零

Python 熊猫在时间序列中计数为零,python,pandas,Python,Pandas,我有一个每日时间序列[1980年至今],我需要检查每个每日时间步是否为零,并系统地删除记录。我最终希望将此解决方案矢量化,以便在继续分析之前可以预处理这些操作。如果我有数据帧df: date name elev_exact swe 0 1990-10-30 COTTONWOOD_CREEK 2337.816 0.01524 1 1990-10-30 EMIGRANT_SUMMIT 2252.472 0.000

我有一个每日时间序列[1980年至今],我需要检查每个每日时间步是否为零,并系统地删除记录。我最终希望将此解决方案矢量化,以便在继续分析之前可以预处理这些操作。如果我有数据帧
df

         date               name  elev_exact      swe
0  1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524
1  1990-10-30    EMIGRANT_SUMMIT    2252.472  0.00000
2  1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
3  1990-10-30    PINE_CREEK_PASS    2048.256  0.00000
4  1990-10-30  SALT_RIVER_SUMMIT    2328.672  0.00000
5  1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
6  1990-10-30          SHEEP_MTN    2026.920  0.00000
7  1990-10-30  SLUG_CREEK_DIVIDE    2202.180  0.00000
8  1990-10-30       SOMSEN_RANCH    2072.640  0.00000
9  1990-10-30   WILDHORSE_DIVIDE    1978.152  0.00000
10 1990-10-30       WILLOW_CREEK    2462.784  0.01778
11 1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
12 1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
13 1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
14 1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
15 1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
16 1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
17 1991-03-15          SHEEP_MTN    2026.920  0.31242
18 1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
19 1991-03-15       SOMSEN_RANCH    2072.640  0.29972
20 1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
21 1991-03-15       WILLOW_CREEK    2462.784  0.60706
22 1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
23 1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
24 1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
25 1991-10-25    PINE_CREEK_PASS    2048.256  0.00508
26 1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
27 1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
28 1991-10-25          SHEEP_MTN    2026.920  0.00000
29 1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
30 1991-10-25       SOMSEN_RANCH    2072.640  0.00000
31 1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508
32 1991-10-25       WILLOW_CREEK    2462.784  0.02032
问题是我想找出超过一个零
swe
测量值的天数,并且只保留最大
elev_精确值的观测值。然后,我需要将所需的零记录合并回
df

下面是一个groupby循环,它将实现我想要的:

result = pd.DataFrame()
for name, group in df.groupby('date'):

    non_zero = group.where(group.swe >0).dropna()

    if not group.equals(non_zero):
        zeros = group.where(group.swe == 0).dropna() 
        zero_kept = zeros.loc[zeros.elev_exact.idxmax()]
        out = non_zero.append(zero_kept)
        out = out[out.elev_exact >= zero_kept.elev_exact]
        result = pd.concat([result, out])
    else:
        result = pd.concat([result, non_zero])
我不介意使用
groupby
,但我想更有条不紊地使用它,这样我就不会有内部
if-else
循环

下面是我对这个问题的看法

  • 对于每一个每日时间步,我想找出哪里有一个以上的零度量
  • zero\u count>1
  • 使用多个零查找每天的最大高程
  • 将找到的高程阈值应用回
    df
    这也是可行的,但是lambda函数的第1步非常慢。还有别的方法数零吗

    预期产出:

              date               name  elev_exact      swe
    2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
    5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
    10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
    11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
    12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
    13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
    14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
    15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
    16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
    17  1991-03-15          SHEEP_MTN    2026.920  0.31242
    18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
    19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
    20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
    21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
    22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
    23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
    24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
    26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
    27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
    29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
    30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
    32  1991-10-25       WILLOW_CREEK    2462.784  0.02032
    

    您可以尝试这样做,将数据帧拆分为非零和零,然后按最高elev_精确值对零数据帧进行排序,并对日期列上的子集使用
    drop_duplicates
    。最后,使用
    pd.concat
    将数据帧重新连接在一起并排序:

    df_nonzeroes = df[df['swe'].ne(0)]
    df_zeroes = df[df['swe'].eq(0)].sort_values('elev_exact', ascending=False).drop_duplicates(subset=['date'])
    
    df_out = pd.concat([df_nonzeroes, df_zeroes]).sort_index()
    print(df_out)
    
    输出:

              date               name  elev_exact      swe
    0   1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524
    2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
    5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
    10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
    11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
    12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
    13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
    14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
    15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
    16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
    17  1991-03-15          SHEEP_MTN    2026.920  0.31242
    18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
    19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
    20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
    21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
    22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
    23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
    24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
    25  1991-10-25    PINE_CREEK_PASS    2048.256  0.00508
    26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
    27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
    29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
    30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
    31  1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508
    32  1991-10-25       WILLOW_CREEK    2462.784  0.02032
    

    您的预期输出是什么?@QuangHoang已更新。我很好奇为什么第5行在第10行之后?我发布了初始for循环尝试的
    result
    的预期结果。第5行在第10行之后,因为我使用了
    非零。追加(保留零)
    。在
    .sort\u index()
    之后重新发布预期结果。这很好。非常感谢。
    df.loc[:,'threshold'] = df.date.map(lu_dict)
    df = df.replace(np.nan, 0)
    df = df[df.elev_exact >= df.threshold].drop('threshold', axis=1)
    
              date               name  elev_exact      swe
    2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
    5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
    10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
    11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
    12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
    13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
    14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
    15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
    16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
    17  1991-03-15          SHEEP_MTN    2026.920  0.31242
    18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
    19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
    20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
    21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
    22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
    23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
    24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
    26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
    27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
    29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
    30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
    32  1991-10-25       WILLOW_CREEK    2462.784  0.02032
    
    df_nonzeroes = df[df['swe'].ne(0)]
    df_zeroes = df[df['swe'].eq(0)].sort_values('elev_exact', ascending=False).drop_duplicates(subset=['date'])
    
    df_out = pd.concat([df_nonzeroes, df_zeroes]).sort_index()
    print(df_out)
    
              date               name  elev_exact      swe
    0   1990-10-30   COTTONWOOD_CREEK    2337.816  0.01524
    2   1990-10-30     PHILLIPS_BENCH    2499.360  0.05334
    5   1990-10-30      SEDGWICK_PEAK    2392.680  0.00000
    10  1990-10-30       WILLOW_CREEK    2462.784  0.01778
    11  1991-03-15   COTTONWOOD_CREEK    2337.816  0.41910
    12  1991-03-15    EMIGRANT_SUMMIT    2252.472  0.42418
    13  1991-03-15     PHILLIPS_BENCH    2499.360  0.52832
    14  1991-03-15    PINE_CREEK_PASS    2048.256  0.32258
    15  1991-03-15  SALT_RIVER_SUMMIT    2328.672  0.23876
    16  1991-03-15      SEDGWICK_PEAK    2392.680  0.39878
    17  1991-03-15          SHEEP_MTN    2026.920  0.31242
    18  1991-03-15  SLUG_CREEK_DIVIDE    2202.180  0.29464
    19  1991-03-15       SOMSEN_RANCH    2072.640  0.29972
    20  1991-03-15   WILDHORSE_DIVIDE    1978.152  0.35052
    21  1991-03-15       WILLOW_CREEK    2462.784  0.60706
    22  1991-10-25   COTTONWOOD_CREEK    2337.816  0.01270
    23  1991-10-25    EMIGRANT_SUMMIT    2252.472  0.01016
    24  1991-10-25     PHILLIPS_BENCH    2499.360  0.02286
    25  1991-10-25    PINE_CREEK_PASS    2048.256  0.00508
    26  1991-10-25  SALT_RIVER_SUMMIT    2328.672  0.01016
    27  1991-10-25      SEDGWICK_PEAK    2392.680  0.00254
    29  1991-10-25  SLUG_CREEK_DIVIDE    2202.180  0.00762
    30  1991-10-25       SOMSEN_RANCH    2072.640  0.00000
    31  1991-10-25   WILDHORSE_DIVIDE    1978.152  0.00508
    32  1991-10-25       WILLOW_CREEK    2462.784  0.02032