Python 先行或可伸缩迭代器
我需要提取文本中符合特定标准的所有单词,例如出现在某个词典中Python 先行或可伸缩迭代器,python,python-3.x,iterator,Python,Python 3.x,Iterator,我需要提取文本中符合特定标准的所有单词,例如出现在某个词典中 some_dict = set() # initialize from file def test1(word): return word in some_dict def extract1(text): return [word for word in text.split() if test1(word)] 是的,但字典里有些词条是由几个词组成的,最多有4个 MAX_DEPTH = 4 def ext
some_dict = set() # initialize from file
def test1(word):
return word in some_dict
def extract1(text):
return [word for word in text.split() if test1(word)]
是的,但字典里有些词条是由几个词组成的,最多有4个
MAX_DEPTH = 4
def extract2(text):
words = text.split()
return [word for i, word in enumerate(words) if test2(words[i:i + MAX_DEPTH])]
def test2(words):
for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
if phrase in some_dict:
return True
return False
哦,但我需要整个短语,而不仅仅是第一个单词,所以
def extract3(text):
words = text.split()
res = []
for i in range(len(words)):
matched = test3(words[i:i + MAX_DEPTH])
if matched:
res.append(matched)
return res
def test3(words):
for phrase in (' '.join(words[:i]) for i in range(1, len(words))):
if phrase in some_dict:
return phrase
return None
好吧,但是如果一个多单词短语匹配,我需要跳过它,而不是测试它的其他单词,即使它们在字典中作为单独的单词出现。所以我需要一个可伸缩的迭代器。这是我试图实现的一个目标:
from copy import copy
def extract4(text):
words = text.split()
res = []
it = iter(words)
try:
while True:
matched, it = test4(it)
if matched:
res.append(matched)
except StopIteration:
pass
return res
def test4(it):
words = [next(it)] # will raise StopIteration when the list is exhausted
save = copy(it)
try:
for _ in range(MAX_DEPTH):
phrase = ' '.join(words)
if phrase in some_dict:
return phrase, it # skip the phrase
words.append(next(it))
except StopIteration:
pass
return None, save # retract
我有点担心为文本中的每个单词创建迭代器副本可能会带来的性能影响,因为它可能相当长。总的来说,从风格和性能两方面来看,这是否可以改进
编辑:提出了双向迭代器的解决方案,但我希望客户端使用标准迭代器可能重复的@Phylogenesis,我在编辑中对此进行了说明