Python精确匹配-字符串中的词典元素精确匹配_Python_Regex_String

Python精确匹配-字符串中的词典元素精确匹配

python regex string

Python精确匹配-字符串中的词典元素精确匹配,python,regex,string,Python,Regex,String,我有一个包含数千个字符串（包括：单字、复合字、使用连字符的复合字以及字符串）的词典和一个包含文本文档的数据集。我希望能够计算出现在每个文本文档中的精确元素（出现在词典中）的数量我试过这个： lexicon = ['A', 'FOO', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if word in instance: matches.append(word) 尽管预

我有一个包含数千个字符串（包括：单字、复合字、使用连字符的复合字以及字符串）的词典和一个包含文本文档的数据集。我希望能够计算出现在每个文本文档中的精确元素（出现在词典中）的数量
我试过这个：

lexicon = ['A', 'FOO', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if word in instance: matches.append(word)
尽管预期结果是
['A']
，但上面的代码也将返回子字符串
['A'，FOO'，F']
使用正则表达式的第二种方法：

matches = [] for word in lexicon: if re.search(r'\b' + word + r'\b', instance): #if re.search(r'\b({})\b'.format(word), instance): matches.append(word)
虽然以这种方式获得的列表正是我需要的，但我得到以下错误：

File "<ipython-input-18-5331958cdf85>", line 4, in <module> if re.search(r'\b' + word + r'\b', instance): File "/opt/anaconda3/lib/python3.7/re.py", line 183, in search return _compile(pattern, flags).search(string) File "/opt/anaconda3/lib/python3.7/re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "/opt/anaconda3/lib/python3.7/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/opt/anaconda3/lib/python3.7/sre_parse.py", line 938, in parse raise source.error("unbalanced parenthesis") error: unbalanced parenthesis

文件“”，第4行，在如果重新搜索（r'\b'+word+r'\b'，实例）：文件“/opt/anaconda3/lib/python3.7/re.py”，第183行，搜索中返回编译（模式、标志）。搜索（字符串）文件“/opt/anaconda3/lib/python3.7/re.py”，第286行，在编译中 p=sre_compile.compile（模式、标志）文件“/opt/anaconda3/lib/python3.7/sre_compile.py”，第764行，编译中 p=sre_parse.parse（p，标志）文件“/opt/anaconda3/lib/python3.7/sre_parse.py”，第938行，在parse中 raise source.error（“不平衡括号”）错误：不平衡括号
我不知道如何解决这个错误，或者如何以不同的方式解决这个问题

任何帮助都将不胜感激
我认为您要查找的是词典中的单词在文档中作为标记出现的次数。如果是这种情况，那么这应该是可行的：

import re lexicon = ['A', 'FOO(()))', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if re.search(r'\b' + re.escape(word) + r'\b', instance): matches.append(word) print(matches)

lexicon=['A'，'FOO'，'f'] 实例='fA靠近AFOO FO ff' tokens=set（instance.split（））匹配项=[] 对于词典中的单词：如果单词在标记中：匹配。追加（word） #在本例中，匹配项应等于['A']
我认为您要查找的是词典中的单词在文档中作为标记出现的次数。如果是这种情况，那么这应该是可行的：

import re lexicon = ['A', 'FOO(()))', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if re.search(r'\b' + re.escape(word) + r'\b', instance): matches.append(word) print(matches)

lexicon=['A'，'FOO'，'f'] 实例='fA靠近AFOO FO ff' tokens=set（instance.split（））匹配项=[] 对于词典中的单词：如果单词在标记中：匹配。追加（word） #在本例中，匹配项应等于['A']
您的正则表达式版本的问题是，
词典
列表中的某些单词可能包含特殊的正则表达式字符-
（
，
[
等）
避开词典中的单词，它应该会起作用：

import re lexicon = ['A', 'FOO(()))', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if re.search(r'\b' + re.escape(word) + r'\b', instance): matches.append(word) print(matches)
印刷品：

['A']

正则表达式版本的问题是，
词典
列表中的某些单词可能包含特殊的正则表达式字符-
（
，
[
，等等）
避开词典中的单词，它应该会起作用：

import re lexicon = ['A', 'FOO(()))', 'f'] instance = 'fA near A AFOO FO ff' matches = [] for word in lexicon: if re.search(r'\b' + re.escape(word) + r'\b', instance): matches.append(word) print(matches)
印刷品：

['A']

您可能需要转义
单词
，因为它包含特殊的正则字符，请尝试
re.search（r'\b'+re.escape（word）+r'\b'，instance）
@AndrejKesely完美无瑕，就是这样！谢谢！您想让它成为我可以接受的答案吗？是的，您是对的-我做了一个答案：）您可能需要转义
单词
，因为它包含特殊的正则字符，请尝试
re.search（r'\b'+re.escape（word）+r'\b'，instance）
@AndrejKesely完美无瑕，就是这样！谢谢！您想让它成为我可以接受的答案吗？是的，您是对的-我做了一个答案：）谢谢你的回答！我不想使用split（）函数，因为词典包含由两个或三个单词组成的元素，在这种情况下，我无法检测到这些单词…尽管这个答案可能会帮助其他人：）谢谢你的回答！我不想使用split（）函数函数，因为词典包含由两个或三个单词组成的元素，在这种情况下，我无法检测到这些单词…尽管这个答案可能对其他人有所帮助：）