Python 在字符串列中搜索多个子字符串并返回子字符串类别_Python_String_Pandas_Dataframe_Lookup

Python 在字符串列中搜索多个子字符串并返回子字符串类别

python string pandas dataframe

Python 在字符串列中搜索多个子字符串并返回子字符串类别,python,string,pandas,dataframe,lookup,Python,String,Pandas,Dataframe,Lookup,我有两个数据帧，如下所示： df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"], "string":["This is a cat", "That is a dog", "Those are birds",

我有两个数据帧，如下所示：

df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
                    "string":["This is a cat",
                              "That is a dog",
                              "Those are birds",
                              "These are bats",
                              "I drink coffee",
                              "I bought tea"]})

df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
                    "keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})

我的数据帧看起来像这样

df1：

df2：

我希望在df1上有一个输出列，如果在df1的每个字符串中检测到df2中至少有一个关键字，则该列是类别，否则返回None。预期输出应如下所示

id   string             category
01   This is a cat         1
02   That is a dog         1
03   Those are birds       2
04   These are bats        2
05   I drink coffee        3
06   I bought tea          3

我可以考虑一个接一个地循环关键字，一个接一个地扫描字符串，但如果数据越来越大，效率就不够了。你能给我一些改进的建议吗？多谢各位

# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
                    "string":["This is a cat",
                              "That is a dog",
                              "Those are birds",
                              "These are bats",
                              "I drink coffee",
                              "I bought tea", 
                              "This won't match squat"]})

您可以使用包含

next

和默认参数的列表

df1['category'] = [
    next((c for c, k in df2.values if k in s), None) for s in df1['string']] 

df1
   id                  string  category
0  01           This is a cat       1.0
1  02           That is a dog       1.0
2  03         Those are birds       2.0
3  04          These are bats       2.0
4  05          I drink coffee       3.0
5  06            I bought tea       3.0
6  07  This won't match squat       NaN

您无法避免O（N2）复杂性，但这应该是相当高的性能，因为它不必总是迭代内部循环中的每个字符串（除非在最坏的情况下）

请注意，目前仅支持子字符串匹配（不支持基于正则表达式的匹配，尽管可以做一些修改）。

使用列表理解和

拆分以及按df2
创建的字典进行匹配：
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]

print (df1)
   id           string  cat
0  01    This is a cat  1.0
1  02    That is a dog  1.0
2  03  Those are birds  2.0
3  04   These are bats  2.0
4  05   I drink coffee  3.0
5  06    I bought thea  NaN

另一个易于理解的解决方案映射df1['string']
：
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))

def categorize(s):
    for cat in cats.keys():
        if cat in s:
            return cats[cat]
    # return 0 in case nothing is found
    return 0

df1['category'] = df1['string'].map(lambda x: categorize(x))

print(df1)

   id           string  category
0  01    This is a cat         1
1  02    That is a dog         1
2  03  Those are birds         2
3  04   These are bats         2
4  05   I drink coffee         3
5  06     I bought tea         3

我想他们想要的是分类，而不是关键词。检查他们的预期输出。@jezrael，这是正确的，我看到许多用户甚至下载了接受的答案。我看到了很多这样的情况，但没有任何评论。我过去喜欢每一个解决方案，只要它们是独一无二的，并且习惯于投票。否决权不在我的字典里：-）@pygo-ya，最好的办法是在有什么不对劲的时候发表评论。。。或者，如果解决方案得到纠正，则进行向下投票、评论和向上投票…@jezrael，是的，事实上，人们正在投入他们宝贵的时间，因此应该始终鼓励向下投票，而不是一个看不见的紧缩：-）我真的很好奇性能，是否有可能在真实数据中检查已接受和我的解决方案的性能？因为Elar解决方案和我一样。非常感谢。
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]

print (df1)
   id           string  cat
0  01    This is a cat  1.0
1  02    That is a dog  1.0
2  03  Those are birds  2.0
3  04   These are bats  2.0
4  05   I drink coffee  3.0
5  06    I bought thea  NaN

# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))

def categorize(s):
    for cat in cats.keys():
        if cat in s:
            return cats[cat]
    # return 0 in case nothing is found
    return 0

df1['category'] = df1['string'].map(lambda x: categorize(x))

print(df1)

   id           string  category
0  01    This is a cat         1
1  02    That is a dog         1
2  03  Those are birds         2
3  04   These are bats         2
4  05   I drink coffee         3
5  06     I bought tea         3