Python 具有相似字符串的两个数据帧的匹配序列parttern保持索引和序列_Python_Pandas_Scikit Learn

Python 具有相似字符串的两个数据帧的匹配序列parttern保持索引和序列

python pandas scikit-learn

Python 具有相似字符串的两个数据帧的匹配序列parttern保持索引和序列,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我有两个数据帧df和df1。其中我必须匹配一个或多个字符串，并获得唯一匹配的字符串序列，其索引号为df作为输出 df df1 输出： idx id_0 user string 4 009124 17 today 5 000029 13 is 6 548751 21 a 7 479903 19 bright 8 897054 08 sunny 9 336588 7 day 我尝试了几种方法pd.merge、pd.concat、pd.join，还有isin，但是，我得到了错误的索引号 e、 g 将

我有两个数据帧df和df1。其中我必须匹配一个或多个字符串，并获得唯一匹配的字符串序列，其索引号为df作为输出

df1

输出：

idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day

我尝试了几种方法pd.merge、pd.concat、pd.join，还有isin，但是，我得到了错误的索引号

e、 g

将df1.strings转换为list，并使用apply和lambda函数与df进行比较：

import pandas as pd

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)

string_list = df1.string.tolist()
filt = df['string'].apply(lambda x: any([k in x for k in string_list]))
print(df[filt])

    idx    id_0 user  string
2     2  774426   10      am
4     4  009124   17   today
5     5  000029   13      is
6     6  548751   21       a
7     7  479903   19  bright
8     8  897054   08   sunny
9     9  336588    7     day
10   10  294260   16   today
11   11  908751   29      is
12   12  558902   81   rainy
15   15  775098   16   today
16   16  665490   02      is
17   17  887099   07  sunday
18   18  389011   18    ahhh

一种可能的方法如下：

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)


matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))

df.iloc[matching_indices]

输出：

    id_0    user    string
idx         
4   009124  17  today
5   000029  13  is
6   548751  21  a
7   479903  19  bright
8   897054  08  sunny
9   336588  7   day

上面的代码将返回所有匹配的子序列及其正确的索引，而不仅仅是第一次出现

如果希望只返回第一个匹配项，可以在第一次识别匹配项时中断循环，如下所示：

matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))
        break

df.iloc[matching_indices]

“我尝试了几种方法pd.merge、pd.concat、pd.join，也尝试了isin，但是，我得到了错误的索引号。”---请将这些尝试包括在您的问题中。我看到

df

中的一些单词重复出现，您如何知道保留哪个索引/用户？似乎您只是保留了第一条。如果您包含一段可复制的可复制代码，可以创建您的输入数据帧，那么其他人回答您的问题可能会更容易。@np8

df=pd.read_clipboard（）

在这些方面效果很好examples@Dan我必须保留df的索引，而不是小数据集df1TypeError:“in”需要字符串作为左操作数，而不是引发此错误的浮点。但是这个返回也比匹配的项目序列要多。

df = pd.DataFrame([
    [0, "008457", "02", "hello"],
    [1, "990037", "05", "I"],
    [2, "774426", "10", "am"],
    [3, "564389", "08", "sleeping"],
    [4, "009124", "17", "today"],
    [5, "000029", "13", "is"],
    [6, "548751", "21", "a"],
    [7, "479903", "19", "bright"],
    [8, "897054", "08", "sunny"],
    [9, "336588", "7", "day"],
    [10, "294260", "16", "today"],
    [11, "908751", "29", "is"],
    [12, "558902", "81", "rainy"],
    [13, "097856", "19", "with"],
    [14, "110044", "24", "cold"],
    [15, "775098", "16", "today"],
    [16, "665490", "02", "is"],
    [17, "887099", "07", "sunday"],
    [18, "389011", "18", "ahhh"],
    [19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')

df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)


matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))

df.iloc[matching_indices]

    id_0    user    string
idx         
4   009124  17  today
5   000029  13  is
6   548751  21  a
7   479903  19  bright
8   897054  08  sunny
9   336588  7   day

matching_indices = []
for i in range(len(df)-len(df1)+1):
    if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
        matching_indices += list(range(i,i+len(df1)))
        break

df.iloc[matching_indices]