在非索引文本文件中搜索单词的最快方法-Python_Python_Text Search_Search

在非索引文本文件中搜索单词的最快方法-Python

python search

在非索引文本文件中搜索单词的最快方法-Python,python,text-search,search,Python,Text Search,Search,考虑到150万行的文本文件，每行大约50-100个单词要查找包含单词的行，使用os.popen（'grep-w word infle'）似乎比 for line in infile: if word in line: print line 否则，如何在python中搜索文本文件中的单词？搜索大型未索引文本文件的最快方法是什么？有几种快速搜索算法（请参阅）。他们要求你把这个词编译成某种结构。Grep正在使用我还没有在中看到python的的源代码，但是 word是为每一行编译的，

考虑到150万行的文本文件，每行大约50-100个单词

要查找包含单词的行，使用

os.popen（'grep-w word infle'）

似乎比

for line in infile: 
  if word in line:
    print line

否则，如何在python中搜索文本文件中的单词？搜索大型未索引文本文件的最快方法是什么？

有几种快速搜索算法（请参阅）。他们要求你把这个词编译成某种结构。Grep正在使用

我还没有在中看到python的

的源代码，但是
word
是为每一行编译的，这需要时间（我怀疑in
编译任何东西，显然它可以编译它，缓存结果等等），或者
搜索效率很低。在“WordWord”中搜索“Word”，首先检查“WWW”，然后失败，然后检查“O”，然后“R”和“失败”等等，但是如果你聪明的话，没有理由重新检查“O”或“R”。例如，根据搜索的单词创建一个表，告诉它当失败发生时可以跳过多少个字符
我可能建议安装和使用
在我的测试中，它用约2900万行搜索了约1GB的文本文件，在不到一秒钟的时间内，仅用00h 00m 00.73s就找到了数百个搜索词条目
下面是Python 3代码，它使用它来搜索word并计算找到它的次数：
import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))

此版本搜索word并打印找到该word时的行号+实际文本：
import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-wc", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE).stdout.read()
print("Found entries:", output.rstrip().decode('ascii'))

import subprocess

word = "some"
file = "/path/to/some/file.txt"

command = ["/usr/local/bin/ag", "-w", word, file]
output = subprocess.Popen(command, stdout=subprocess.PIPE)

for line in output.stdout.readlines():
    print(line.rstrip().decode('ascii'))

我认为使用正则表达式可能非常快。但由于您的文件非常大，无法将其加载到RAM中进行正则表达式分析。但是，可以按大块读取文件，并使用regex逐个分析每个块。这样做，可能会发现所研究的字符串可能重叠在两个块上，然后没有被检测到。因此，语块的分析必须以某种方式进行。我已经编写了这样的代码，并将其发布在stackoverflow.com上。让我搜索一下，我找到了下面的帖子（），其中的代码旨在检测大文件中的字符串ROW_DEL，并用较短的字符串替换它们。您的问题只是检测一个模式，它更简单。我想你可以看看我引用的帖子，看看我一块接一块地分析文本的方式，并根据你有限的需要调整其原则。