Python 3.x 如何在Python中查找文本中的字符串（在一个大的字符串列表中）？_Python 3.x_Nltk_N Gram

Python 3.x 如何在Python中查找文本中的字符串（在一个大的字符串列表中）？

python-3.x

Python 3.x 如何在Python中查找文本中的字符串（在一个大的字符串列表中）？,python-3.x,nltk,n-gram,Python 3.x,Nltk,N Gram,我想找出新闻文本中列表的名字我有一个大的文本文件（大约100MB），里面有很多地名。每个名称都是文件中的一行 Brasiel Brasier Gap Brasier Tank Brasiilia Brasil Brasil Colonial 文件的一部分 Brasiel Brasier Gap Brasier Tank Brasiilia Brasil Brasil Colonial 新闻文本是这样的： "It's thought the couple may have contracte

我想找出新闻文本中列表的名字

我有一个大的文本文件（大约100MB），里面有很多地名。每个名称都是文件中的一行

Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial

文件的一部分

Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial

新闻文本是这样的：

"It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials.
Hanks is not the only celebrity to have tested positive for the virus. British actor Idris Elba also revealed last week he had tested positive."

例如，在本文中，应建立澳大利亚和昆士兰字符串。我正在使用NLTK库并从新闻中创建ngrams

为此，我正在这样做：

from nltk.util import ngrams

# readings the place name file
file = open("top-ord.txt", "r")
values = file.readlines()

news = "It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials."

# ngrams_list is all ngrams from the news
for item in ngrams_list:
    if item in values:
        print(item)

这太慢了。如何改进它？

将值转换为如下集合：

value_set = {country for country in values}

这将大大加快查找的速度，因为集合的查找在固定时间内运行（与列表的线性时间相反）

此外，在解析文件时（如果需要的话），请确保去掉尾随的换行符。

此外，我想说的是，尝试酸洗

值集

，因为读取速度非常慢，这将有助于显著提高启动速度uptime@MarsilinouZaky我不明白你的评论。@nz_21如果我需要在新闻文本中显示值的确切位置，我该如何处理？@B.Montiero这是一个单独的问题，但将新闻文本解析为一个单词列表，然后创建一个dict，将每个单词映射到它的位置。如果我现在的答案对你原来的问题有帮助，请打绿色勾，这样将来的读者就知道该去哪里看了。