Python 以（时间）有效的方式获取停止字之间的字符串_Python_Performance_Loops_Substring_List Comprehension

Python 以（时间）有效的方式获取停止字之间的字符串

python performance loops

Python 以（时间）有效的方式获取停止字之间的字符串,python,performance,loops,substring,list-comprehension,Python,Performance,Loops,Substring,List Comprehension,假设我有文本： txt='A single house painted white with a few windows and a nice door in front of the park' 我想删除所有的第一个单词，如果他们是停止字，并得到第一个停止字的子字符串期望的结果：被漆成白色的单人房我可以在列表上循环： txt='A single house painted white with a few windows and a nice door in front of the p

假设我有文本：

txt='A single house painted white with a few windows and a nice door in front of the park'

我想删除所有的第一个单词，如果他们是停止字，并得到第一个停止字的子字符串

期望的结果：被漆成白色的单人房

我可以在列表上循环：

txt='A single house painted white with a few windows and a nice door in front of the park'
stopwords = ['a','the','with','this','is','to','etc'] # up to 250 words


for i,word in enumerate(txt.lower().split()):
    pos1= i
    if word in stopwords:
        break

rest_text = txt.split()[pos1+1:]
print(rest_text)
# and now we do the same for pos2

for i,word in enumerate(rest_text):
    pos2= i
    if word in stopwords:
        print(word,pos2)
        break

rest_text = rest_text[:pos2]
print(rest_text)

我必须为成千上万的文本这样做，速度很重要。在python中，循环从来都不是一种可行的方法。但是我不能想出一个列表理解的解决方案

要帮忙吗

注1：我将示例文本加长，以明确结果
注2：
其他例子： txt='这是一个第二个文本，以明确我喜欢的结果'

结果：“第二个文本”

我发现有两种方法可以显著提高性能

设置
而不是
列表

代码必须检查某个字符串是否是

stopwords

的成员。列表不是最好的数据结构，因为在最坏的情况下，它需要与列表中的每个元素进行比较。列表的成员资格测试为O（n）

set

s执行此成员资格测试的速度要快得多。它们在Python中的实现类似于a，这意味着它们可以在恒定时间O（1）中执行成员资格测试。因此，对于大量的元素，对于这个特定的操作，

集合

将显著优于

列表

您可以设置

停止词的集合，而不是包含以下内容的列表：
stopwords = set(['a','the','with','etc'])

re.finditer
而不是str.split（）
如果您的txt
很大，并且您只需要txt
的第一个限定子字符串（如问题中所暗示的），那么您可以使用re.finditer
而不是str.split（）
来分离文本中的单词，从而提高性能
str.split（）。在最坏的情况下，您显然仍然需要对整个文本进行“循环”，但如果您的匹配项接近txt
的开头，则可能会节省大量时间和内存
例如：
txt='A single house painted white with a few windows'
stopwords = set(['a','the','with','etc'])

import re

split_txt = (match.group(0) for match in re.finditer(r'\S+', txt))

result = []
word = next(split_txt)

while word.lower() in stopwords:
    word = next(split_txt)

while word.lower() not in stopwords:
    result.append(word)
    word = next(split_txt)

print(' '.join(result))

但是请注意，与过早地开始优化相比，只使用一些代码开始测试输入通常更好。测试将揭示是否有必要进行优化。你在问题中说
在Python中，循环从来都不是一种可行的方法
但事实并非如此。在任何语言中，以这样或那样的形式循环往往是不可避免的。虽然性能可能无法与C或Fortran等编译语言相媲美，但Python的性能可能会让您大吃一惊（如果您愿意的话）
“在Python中循环从来都不是一种方式”-谁说的？不管怎么说，我在你当前的代码中看到的一个明显的改进是将stopwords
设置为set
，而不是list
.a）@Marco Bonelli：我不断读到列表理解比循环更好。但是我会做一个比较，然后把它贴出来。b）@supernate现在确实是这样了。有几个打字错误。谢谢你的帮助。