Python 在字符串列中搜索多个子字符串并返回子字符串类别
我有两个数据帧,如下所示:Python 在字符串列中搜索多个子字符串并返回子字符串类别,python,string,pandas,dataframe,lookup,Python,String,Pandas,Dataframe,Lookup,我有两个数据帧,如下所示: df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"], "string":["This is a cat", "That is a dog", "Those are birds",
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea"]})
df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
"keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})
我的数据帧看起来像这样
df1:
df2:
我希望在df1上有一个输出列,如果在df1的每个字符串中检测到df2中至少有一个关键字,则该列是类别,否则返回None。预期输出应如下所示
id string category
01 This is a cat 1
02 That is a dog 1
03 Those are birds 2
04 These are bats 2
05 I drink coffee 3
06 I bought tea 3
我可以考虑一个接一个地循环关键字,一个接一个地扫描字符串,但如果数据越来越大,效率就不够了。你能给我一些改进的建议吗?多谢各位
# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea",
"This won't match squat"]})
您可以使用包含next
和默认参数的列表
df1['category'] = [
next((c for c, k in df2.values if k in s), None) for s in df1['string']]
df1
id string category
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought tea 3.0
6 07 This won't match squat NaN
您无法避免O(N2)复杂性,但这应该是相当高的性能,因为它不必总是迭代内部循环中的每个字符串(除非在最坏的情况下)
请注意,目前仅支持子字符串匹配(不支持基于正则表达式的匹配,尽管可以做一些修改)。使用列表理解和
拆分以及按df2
创建的字典进行匹配:
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]
print (df1)
id string cat
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought thea NaN
另一个易于理解的解决方案映射df1['string']
:
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))
def categorize(s):
for cat in cats.keys():
if cat in s:
return cats[cat]
# return 0 in case nothing is found
return 0
df1['category'] = df1['string'].map(lambda x: categorize(x))
print(df1)
id string category
0 01 This is a cat 1
1 02 That is a dog 1
2 03 Those are birds 2
3 04 These are bats 2
4 05 I drink coffee 3
5 06 I bought tea 3
我想他们想要的是分类,而不是关键词。检查他们的预期输出。@jezrael,这是正确的,我看到许多用户甚至下载了接受的答案。我看到了很多这样的情况,但没有任何评论。我过去喜欢每一个解决方案,只要它们是独一无二的,并且习惯于投票。否决权不在我的字典里:-)@pygo-ya,最好的办法是在有什么不对劲的时候发表评论。。。或者,如果解决方案得到纠正,则进行向下投票、评论和向上投票…@jezrael,是的,事实上,人们正在投入他们宝贵的时间,因此应该始终鼓励向下投票,而不是一个看不见的紧缩:-)我真的很好奇性能,是否有可能在真实数据中检查已接受和我的解决方案的性能?因为Elar解决方案和我一样。非常感谢。
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]
print (df1)
id string cat
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought thea NaN
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))
def categorize(s):
for cat in cats.keys():
if cat in s:
return cats[cat]
# return 0 in case nothing is found
return 0
df1['category'] = df1['string'].map(lambda x: categorize(x))
print(df1)
id string category
0 01 This is a cat 1
1 02 That is a dog 1
2 03 Those are birds 2
3 04 These are bats 2
4 05 I drink coffee 3
5 06 I bought tea 3