Python 计算一列中两个值之间的差值，同时保持在另一列的边界内？_Python_Pandas_Dataframe_Nlp_Data Science

Python 计算一列中两个值之间的差值，同时保持在另一列的边界内？

python pandas dataframe nlp

Python 计算一列中两个值之间的差值，同时保持在另一列的边界内？,python,pandas,dataframe,nlp,data-science,Python,Pandas,Dataframe,Nlp,Data Science,我有一个数据帧，试图计算两个不同主题之间的时差，同时保持在一个通话中，并且不会溢出到一个新的通话中（即确保它不会计算出不同通话中主题之间的时差）。其中交互id是一个单独的调用这是一个示例数据帧 df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Paym

我有一个数据帧，试图计算两个不同主题之间的时差，同时保持在一个通话中，并且不会溢出到一个新的通话中（即确保它不会计算出不同通话中主题之间的时差）。其中交互id是一个单独的调用

这是一个示例数据帧

df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])

      interaction_id    start_time     topic 
           1               2           Cost
           1              5.72          NaN
           1              8.83         Billing
           1              12.86         NaN
           2               2            Cost
           2              6.75          NaN
           2              8.54          NaN
           3              1.5          Payments
           3              3.65         Products

a这是所需的输出

df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length'])

       interaction_id    start_time     topic     topic_length

           1               2           Cost           6.83
           1              5.72          NaN           NaN
           1              8.83         Billing        4.03
           1              12.86         NaN           NaN
           2               2            Cost          6.54
           2              6.75          NaN           NaN
           2              8.54          NaN           NaN
           3              1.5          Payments       2.15
           3              3.65         Products       ....

我不知道是否有更简单的解决方案，但这种方法可以解决您的问题：

def custom_agg(group):
    group = group.reset_index(drop=True)
    max_ind = group.shape[0]-1
    current_ind = -1
    current_val = None
    for ind, val in group.iterrows():
        if pd.isna(val.topic) and ind != max_ind:
            continue
        if current_ind == -1:
            current_ind = ind
            current_val = val["start_time"]
        else:
            group.loc[current_ind,"topic_length"] = val["start_time"] - current_val
            current_ind = ind
            current_val = val["start_time"]
    return group
df = df.sort_values(by=['interaction_id', 'start_time']).groupby("interaction_id").apply(custom_agg).reset_index(drop=True)

输出：

    interaction_id  start_time  topic   topic_length
0   1   2.00    Cost    6.83
1   1   5.72    NaN NaN
2   1   8.83    Billing 4.03
3   1   12.86   NaN NaN
4   2   2.00    Cost    6.54
5   2   6.75    NaN NaN
6   2   8.54    NaN NaN
7   3   1.50    Payments    2.15
8   3   3.65    Products    NaN

这正是我所需要的！非常感谢侯赛因：）