String 通过查看两个列表是否具有公共值来计算累积计数_String_Pandas_Match_Cumulative Sum

String 通过查看两个列表是否具有公共值来计算累积计数

string pandas

String 通过查看两个列表是否具有公共值来计算累积计数,string,pandas,match,cumulative-sum,String,Pandas,Match,Cumulative Sum,如果我有一张这样的桌子 |---------------------|------------------| | time | list of string | |---------------------|------------------| | 2019-06-18 09:05:00 | ['A', 'B', 'C']| |---------------------|------------------| | 2019-06-19 09:05:00 |

如果我有一张这样的桌子

|---------------------|------------------|
|      time           | list of string   |
|---------------------|------------------|
| 2019-06-18 09:05:00 |   ['A', 'B', 'C']|
|---------------------|------------------|
| 2019-06-19 09:05:00 |   ['A', 'C']     |
|---------------------|------------------|
| 2019-06-19 09:05:00 |   ['B', 'C']     |
|---------------------|------------------|
| 2019-06-20 09:05:00 |   ['C']          |
|---------------------|------------------|
| 2019-06-20 09:05:00 |   ['A', 'B', 'C']|
|---------------------|------------------|

对于每一行，我想知道在当前时间戳之前有多少行在当前字符串列表中至少有一个公共值

缓慢的代码如下所示：

results = [] for i in range(len(df)):
    current_t = df['time'].iloc[i]
    current_string = df['list_of_string'].iloc[i]
    df_before_t = df[df['time']<current_t]
    cumm_count = 0
    for row in df_before_t['list_of_string']:
        if (set(current_string) & set(row)):
            cumm_count += 1
    results.append(cumm_count)

我目前拥有的数据集相对较大，我希望得到帮助以快速处理这些数据。非常感谢

一种方法是将列表转换为集合，并在

字符串列表

上使用listcomp，将

时间

与小于当前

时间的列表进行比较
s = df['list of string'].map(set)
t = pd.to_datetime(df.time)

df['result'] = [sum(len(x & y) != 0 for y in s[t.iloc[i] > t]) 
                                        for i,x in enumerate(s)]

Out[283]:
                  time list of string  result
0  2019-06-18 09:05:00      [A, B, C]       0
1  2019-06-19 09:05:00         [A, C]       1
2  2019-06-19 09:05:00            [D]       0
3  2019-06-20 09:05:00            [C]       2
4  2019-06-20 09:05:00      [A, B, C]       2

非常感谢。但是，如果同一时间戳存在于多行中，我需要获取在比较当前行之前发生的行的计数。如何修改代码？但在这种情况下，带有[D]
的行应为1，因为其时间与带有[D]的行相同。对于带有[D]的行，只有第一行有资格进行比较（2019-06-18 09:05:00在2019-06-19 09:05:00之前）。因为[A，B，C]和[D]没有公共值，所以带[D]的行的结果是0。@M.Cong:hah！我现在不明白了。只是一个小小的改变。我编辑了答案。检查我的更新
s = df['list of string'].map(set)
t = pd.to_datetime(df.time)

df['result'] = [sum(len(x & y) != 0 for y in s[t.iloc[i] > t]) 
                                        for i,x in enumerate(s)]

Out[283]:
                  time list of string  result
0  2019-06-18 09:05:00      [A, B, C]       0
1  2019-06-19 09:05:00         [A, C]       1
2  2019-06-19 09:05:00            [D]       0
3  2019-06-20 09:05:00            [C]       2
4  2019-06-20 09:05:00      [A, B, C]       2