Python 多条件模糊连接_Python_Pandas

Python 多条件模糊连接

python pandas

Python 多条件模糊连接,python,pandas,Python,Pandas,我知道有一些题目类似的问题，但没有一个能真正回答我的问题。我有一个数据框，如下所示。“索引”列实际上是时间戳。A栏是指有多少吨物料被倾倒到破碎机中。B列是每个时间戳的破碎率。我想知道的是，根据破碎率（B列），一批材料（a列）何时会被破碎有三种可能的情况第一批货物在第二批货物装载时压碎第一个负载在第二个负载之前被压碎当添加第二个荷载时，第一个荷载不会被压碎我试图计算A列和B列的累积值，并使用merge_asof执行模糊联接。但由于没有储存过多的压碎能力，因此它并没有像预期的那样工作。仅

我知道有一些题目类似的问题，但没有一个能真正回答我的问题。我有一个数据框，如下所示。“索引”列实际上是时间戳。A栏是指有多少吨物料被倾倒到破碎机中。B列是每个时间戳的破碎率。我想知道的是，根据破碎率（B列），一批材料（a列）何时会被破碎

有三种可能的情况

第一批货物在第二批货物装载时压碎

第一个负载在第二个负载之前被压碎

当添加第二个荷载时，第一个荷载不会被压碎

我试图计算A列和B列的累积值，并使用merge_asof执行模糊联接。但由于没有储存过多的压碎能力，因此它并没有像预期的那样工作。仅应考虑装载材料后的破碎率

A={'index'：范围（1,11），'A'：[300,0,0400,0,0,0150,0]}
B={'index'：范围（1,11），'B'：[102103,94120145114126117107100]}
A=pd.DataFrame（数据=A）
B=pd.DataFrame（数据=B）

以下是预期结果：

IndexA  A   IndexB  B_accumulate 
1      300  4       419
4      400  8       502
9      150  10      207

B_accumulate是运行压碎率（B）的总和，当一堆材料被压碎（当B_accumulate>=a）时，它被重置为0。我相信你可以简化它

C = A.join(B.set_index('index'), on='index')

C['A_filled'] = C['A'].replace(to_replace=0, method='ffill')
C['cumul_load'] = C['A'].cumsum()
C['load_number'] = C.groupby('cumul_load').ngroup() + 1
C['B_accum'] = C.groupby('load_number')['B'].cumsum()
C['A_fully_crushed'] = C['B_accum'] > C['A_filled']
C['first_index_fully_crushed'] = C.groupby('load_number')['A_fully_crushed'].cumsum() == 1

indexA_ = C['index'][C['A'] > 0].tolist()
A_ = C['A'][C['A'] > 0].tolist()
indexB_ = C['index'][C['first_index_fully_crushed'] == True].tolist()
B_accumulate_ = C['B_accum'][C['first_index_fully_crushed'] == True].tolist()
result = pd.DataFrame({'indexA': indexA_, 'A': A_, 'indexB': indexB_, 'B_accumulate': B_accumulate_})

这就产生了

   indexA    A  indexB  B_accumulate
0       1  300       4           419
1       6  400       9           464

创建DF组合A&B：

A = {'index':range(1,11),'A':[300,0,400,0,0,0,0,0,100,0]}
B = {'index':range(1,11),'B':[102,103,94,120,145,114,126,117,107,87]}
df_A = pd.DataFrame(data=A)
df_B = pd.DataFrame(data=B)
df_com = pd.concat([df_A,df_B],axis=1).drop('index',axis=1)

创建索引：

indexA = list(df_com.A[df_com.A.ne(0)].index + 1)
indexB = np.array(indexA) - 2
indexB = np.append(indexB[1:],(len(df_com)-1))

将0替换为列A中的ffill（）：

df_com['A'] = df_com.A.replace(0,method='pad')

groupby和add索引列：

df_new =df_com.groupby("A",sort=False).apply(lambda x:x.B.shift(1).sum()).reset_index()
df_new['indexA'] = indexA
df_new['indexB'] = indexB
df_new

可能的办法。问题分为两部分-获取材料的实际数量（不能为负）和分析负载（当当前时间步长内有任何数量的材料要压碎时，按行分组）

我简化了结构，使用了Series而不是DataFrame，索引从零开始。将应用cumsum（）和searchsorted（）

Load = pd.Series([300,0,0,400,50,0,0,0,150,0])  # aka 'A'
Rate = pd.Series([102,103,94,120,145,114,126,117,107,100])  # aka 'B'

# Storage for the result:
H=[]    # [ (indexLoad, Load, indexRate, excess) ... ]

# Find the 1st non 0 load:
load1_idx= len(Load)

for lix in range(len(Load)):
    a= Load[lix]
    if a!=0:
        csumser= Rate.cumsum()
        rix= csumser.searchsorted(a)
        excess= csumser[rix]-a
        H.append( (lix,a,rix,excess) )
        load1_idx=lix
        break

# Processing
for lix in range(load1_idx+1,len(Load)):

    a=Load[lix]
    if a==0:
        continue

    last_rix= H[-1][-2]
    csumser[last_rix:]= Rate[last_rix:]
    if lix==last_rix:
        csumser[lix]= H[-1][-1] # excess

    csumser[last_rix:]= csumser[last_rix:].cumsum()

    rix= csumser[last_rix:].searchsorted(a)
    rix+= last_rix
    excess= csumser[rix]-a
    H.append( (lix,a,rix,excess) )       

df= pd.DataFrame(H, columns=["indexLoad","Load","indexRate","rate_excess"])
print(df)

   indexLoad  Load  indexRate  rate_excess
0          0   300          3          119
1          3   400          6          104
2          4    50          6           76
3          8   150          7           93

你好@WolfgangK，谢谢你的解决方案。然而，它并不是在所有情况下都有效。我还应该提到的是，有些情况下，第一个负载尚未完全压碎，但第二个负载已经添加。例如，如果我可以将数据帧A更改为A={'index'：范围（1,11），'A'：[300,0,0400,0,0,0,0,0]}，则您的解决方案将不起作用。@Cypress感谢您查看我的帖子。我的解决方案还不能涵盖其他情况吗？如果是这样，考虑将这些情况添加到你的问题中。我认为有三种可能的情况。1）第一批货物在第二批货物之前被压碎。2）第一个负载在第二个负载之前被压碎。3）当添加第二个荷载时，第一个荷载不会被压碎。我会更新这个问题。嗨，瑞恩，谢谢你的帖子。但是，您的解决方案仅适用于提供的示例数据。然而，实际数据比样本大得多，你的方法不起作用。在指数4，300吨的第一批货物被压碎。此时，仍有剩余的破碎能力，下一个负载已经加载。但是，下一个负载的破碎仅在下一个时间点开始。这是正确的吗？我们可以将此解释为“连续加载不混合”吗？嗨，Poolka，谢谢你的回复。但是，我已经用我的原始数据集尝试过了，您的解决方案似乎仍然不能解决第三种情况。对于这个场景，您的代码将结合第一次和第二次加载，这不是我想要的。谢谢kantal的文章。你的密码几乎回答了我的问题。我只需要将您的上一个X计算更改为上一个X=H[-1][2]+1。非常感谢你的帮助！

   indexA  total_load  loads_qty  indexB  total_work
0       1         300          1       4         419
1       6         500          2      10         551
2      12         300          1      15         419
3      17         500          2      21         551

Load = pd.Series([300,0,0,400,50,0,0,0,150,0])  # aka 'A'
Rate = pd.Series([102,103,94,120,145,114,126,117,107,100])  # aka 'B'

# Storage for the result:
H=[]    # [ (indexLoad, Load, indexRate, excess) ... ]

# Find the 1st non 0 load:
load1_idx= len(Load)

for lix in range(len(Load)):
    a= Load[lix]
    if a!=0:
        csumser= Rate.cumsum()
        rix= csumser.searchsorted(a)
        excess= csumser[rix]-a
        H.append( (lix,a,rix,excess) )
        load1_idx=lix
        break

# Processing
for lix in range(load1_idx+1,len(Load)):

    a=Load[lix]
    if a==0:
        continue

    last_rix= H[-1][-2]
    csumser[last_rix:]= Rate[last_rix:]
    if lix==last_rix:
        csumser[lix]= H[-1][-1] # excess

    csumser[last_rix:]= csumser[last_rix:].cumsum()

    rix= csumser[last_rix:].searchsorted(a)
    rix+= last_rix
    excess= csumser[rix]-a
    H.append( (lix,a,rix,excess) )       

df= pd.DataFrame(H, columns=["indexLoad","Load","indexRate","rate_excess"])
print(df)

   indexLoad  Load  indexRate  rate_excess
0          0   300          3          119
1          3   400          6          104
2          4    50          6           76
3          8   150          7           93