Python 用于分词的正则表达式-将单词拆分为语素或词缀
我试图在将一个单词分割成后缀和前缀(即语素或词缀)等成分后得到一个列表 我尝试过使用正则表达式,使用Python 用于分词的正则表达式-将单词拆分为语素或词缀,python,regex,findall,Python,Regex,Findall,我试图在将一个单词分割成后缀和前缀(即语素或词缀)等成分后得到一个列表 我尝试过使用正则表达式,使用re.findall函数。 (如下所示) 但是,我需要包含它不匹配的部分。例如,上述示例需要输出: ['di','meth','yl','amin0','eth','an','ol'] 有人知道如何提取列表中的这些段吗?您可以使用捕获“分隔符”: 这里的列表理解是过滤空字符串匹配项 import re affixes = ['meth','eth','ketone', 'di', 'chloro
re.findall
函数。(如下所示) 但是,我需要包含它不匹配的部分。例如,上述示例需要输出:
['di','meth','yl','amin0','eth','an','ol']
有人知道如何提取列表中的这些段吗?您可以使用捕获“分隔符”:
这里的列表理解是过滤空字符串匹配项
import re
affixes = ['meth','eth','ketone', 'di', 'chloro', 'yl', 'ol']
word = 'dimethylamin0ethanol'
# found = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol']
found = re.findall('|'.join(affixes), word)
# not_found = [('', 'di'), ('', 'meth'), ('', 'yl'), ('amin0', 'eth'), ('an', 'ol')]
not_found = re.findall(r'(.*?)(' + '|'.join(affixes) + ')', word)
# We need to modify extract the first item out of each tuple in not_found
# ONLY when it does not equal "".
all_items = map(lambda x: x[0], filter(lambda x: x[0] != "", not_found)) + found
print all_items
# all_items = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol']
假设:您的最终列表不需要特定的顺序
In [1]: import re
In [2]: affixes = ['meth', 'eth', 'ketone', 'di', 'chloro', 'yl', 'ol']
In [3]: word = 'dimethylamin0ethanol'
In [4]: [match for match in re.split('(' + '|'.join(affixes) + ')', word) if match]
Out[4]: ['di', 'meth', 'yl', 'amin0', 'eth', 'an', 'ol']
import re
affixes = ['meth','eth','ketone', 'di', 'chloro', 'yl', 'ol']
word = 'dimethylamin0ethanol'
# found = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol']
found = re.findall('|'.join(affixes), word)
# not_found = [('', 'di'), ('', 'meth'), ('', 'yl'), ('amin0', 'eth'), ('an', 'ol')]
not_found = re.findall(r'(.*?)(' + '|'.join(affixes) + ')', word)
# We need to modify extract the first item out of each tuple in not_found
# ONLY when it does not equal "".
all_items = map(lambda x: x[0], filter(lambda x: x[0] != "", not_found)) + found
print all_items
# all_items = ['amin0', 'an', 'di', 'meth', 'yl', 'eth', 'ol']