Python使用字符串中的复合词计算正则表达式的匹配
我有一个正则表达式字典,我想计算字典中包含复合词的主题和正则表达式的匹配项Python使用字符串中的复合词计算正则表达式的匹配,python,regex,pandas,Python,Regex,Pandas,我有一个正则表达式字典,我想计算字典中包含复合词的主题和正则表达式的匹配项 import pandas as pd terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)", 'people':'(John Adams|Rob|Steve|Superman|Super man)', 'games':'(basketball|basket ball|bball)' } df=pd.DataFrame({ 'Scor
import pandas as pd
terms = {'animals':"(fox|russian brown deer|bald eagle|arctic fox)",
'people':'(John Adams|Rob|Steve|Superman|Super man)',
'games':'(basketball|basket ball|bball)'
}
df=pd.DataFrame({
'Score': [4,6,2,7,8],
'Foo': ['Superman was looking for a russian brown deer.', 'John adams started to play basket ball with rob yesterday before steve called him','Basketball or bball is a sport played by Steve afterschool','The bald eagle flew pass the arctic fox three times','The fox was sptted playing basket ball?']
})
要计算匹配项,我可以使用与问题类似的代码:。但它将字符串拆分为空格,然后计算不包括复合项的项。有什么替代方法可以包含由空格连接的复合项
df1 = df.Foo.str.split(expand=True).stack().reset_index(level=1, drop=True).reset_index(name='Foo')
for k, v in terms.items():
df1[k] = df1.Foo.str.contains('(?i)(^|\s)'+terms[k]+'($|\s|\.|,|\?)')
df2= df1.groupby('index').sum().astype(int)
df = pd.concat([df,df2], axis=1)
print(df)
最终结果应该如下所示:
Foo Score animals people \
0 Superman was looking for a russian brown deer. 4 1 1
1 John adams started to play basket ball with ro... 6 0 3
2 Basketball or bball is a sport played by Steve... 2 0 1
3 The bald eagle flew pass the artic fox three t... 7 3 0
4 The fox was sptted playing basket ball 8 1 0
games
0 0
1 1
2 2
3 0
4 1
请注意,对于第3行,北极狐中的fox一词和北极狐一词应在动物栏中各计算一次(一起计算两次)。请查看这是否是您要查找的内容:
import(re)
for k in terms.keys():
df[k] = 0
for words in re.sub("[()]","",terms[k]).split('|'):
mask = df.Foo.str.contains(words, case = False)
df[k] += mask
df
Foo Score people animals games
0 Superman was looking for a russian brown deer. 4 1 1 0
1 John adams started to play basket ball with ro... 6 3 0 1
2 Basketball or bball is a sport played by Steve... 2 1 0 2
3 The bald eagle flew pass the arctic fox three ... 7 0 3 0
4 The fox was sptted playing basket ball? 8 0 1 1
是的,谢谢你不太熟悉regex库中的sub函数
sub
代表substitute。如果在初始术语词典中插入括号,则不需要此sub