Python regex:re.search（）在大型文本文件上非常慢_Python_Regex_Performance

Python regex:re.search（）在大型文本文件上非常慢

python regex performance

Python regex:re.search（）在大型文本文件上非常慢,python,regex,performance,Python,Regex,Performance,我的代码执行以下操作： fileHandle = open('test_pdf.txt', mode='r') document = fileHandle.read() def search(searchText, doc, n): #Searches for text, and retrieves n words either side of the text, which are returned separately surround = r"\s*(\S*)\s*"

我的代码执行以下操作：

fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()

def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately

    surround = r"\s*(\S*)\s*"
    groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
    return groups[:n],groups[n:]

print search("\$27.5 million", document, 10)

以一个大的文本文件（即一份300页的PDF格式的法律文件）为例

查找某个关键字（例如“small”）

将

单词返回到关键字的左侧，将

单词返回到关键字的右侧

注意：在此上下文中，“单词”是任何非空格字符字符串。“cow123美元”是一个词，但“医疗保健”是两个词

这是我的问题： 在300页上运行代码需要非常长的时间，并且随着

的增加，时间会迅速增加

这是我的代码：

fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()

def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately

    surround = r"\s*(\S*)\s*"
    groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
    return groups[:n],groups[n:]

print search("\$27.5 million", document, 10)

这就是罪魁祸首：

fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()

def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately

    surround = r"\s*(\S*)\s*"
    groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
    return groups[:n],groups[n:]

print search("\$27.5 million", document, 10)

下面是测试此代码的方法： 从上面的代码块复制函数定义，并运行以下操作：

t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)

我怀疑我有一个可怕的灾难性回溯案例，但我对regex来说太陌生了，无法指出这个问题

如何提高代码的速度？我认为您的做法完全是倒退的（我有点搞不清楚您首先在做什么！）

我建议您检查一下我在我的系统的textools模块中开发的

re\u search

功能

通过搜索，您可以通过以下方式解决此问题：

from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str)  # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
    if isinstance(regpart, basestring):
        words = textools.re_search('\w+', regpart)
        # do stuff with words
    else:
        # I Think you are ignoring these? Not totally sure

以下是一个关于如何使用及其工作原理的链接：

除此之外，正则表达式还将以更可读的格式打印出来

您可能还想查看我的工具或类似工具Kiki，以帮助您构建和理解正则表达式。

您可以尝试使用

mmap

和适当的正则表达式标志，例如（未测试）：

但这只会降低内存使用率

另一种选择是有一个滑动窗口的话（简单的例子，只有一个字之前和之后）

使用

re.search

（或者甚至

string.find

（如果您只搜索固定字符串）来查找字符串，而不使用任何周围的捕获组。然后使用匹配的位置和长度（

.start

和

.end

在重新匹配对象上，或者使用

find

的返回值加上搜索字符串的长度）。获取匹配前的子字符串并在其上执行

/\s*（\s*）\s*\z/

等操作，获取匹配后的子字符串并在其上执行

/\A\s*（\s*）\s*/

等操作

另外，为了帮助您回溯：您可以使用类似于

\s+\s+\s+

的模式，而不是

\s*\s*

（两块空白必须由非零数量的非空白分隔，否则它们不会是两块），并且您不应该像您那样对接两个连续的

\s*

。我认为

r'\S+'.join（[[r'\S+']*（n））

将为捕获

前面的单词提供正确的模式（但是我的Python已经生锈了，所以请检查一下）。

我在这里看到了几个问题。第一个问题，可能是最糟糕的问题，是“环绕”正则表达式中的所有内容都是可选的，不仅是可选的，而且是独立可选的。给定以下字符串：

“Lorem ipsum tritani阻抗civibus ei pri”

…当

searchText=“tritani”

和

n=1

时，这就是它在找到第一个匹配项之前必须经历的过程：

regex:\s*\s*\s*tritani
偏移量0:“”Lorem“”失败
“Lorem”失败了
“知识”失败
“Lor”失败
“Lo”失败
“L”失败了
''''失败

…然后它向前冲一个位置并重新开始：

偏移量1:“orem”失败 “orem”失败 “矿石”失败 ''或''失败 “o”失败 ''''失败 …等等。根据RegexBuddy的调试器，它需要将近150个步骤才能到达偏移量，在偏移量中它可以进行第一次匹配：

位置5:''同侧'''特里塔尼'

只需跳过一个单词，然后使用

n=1

。如果设置

n=2

，则会得到以下结果：

\s*（\s*）\s*\s*（\s*）\s*tritani\s*（\s*）\s*\s*（\s*）\s**

我肯定你能看到这是怎么回事。特别要注意，当我把它改成这个：

（？：\s+）（\s+）（\s+）（\s+）（\s+）（\s+）（\s+）（\s+）（\s+）（\s+）（\s+）

…它在20多个步骤中找到第一个匹配项。这是最常见的正则表达式反模式之一：在应该使用

时使用

。换句话说，如果它不是可选的，不要将其视为可选的

最后，你可能已经注意到了自动生成的正则表达式，我觉得这应该是一个很好的问题。（1）你和正则表达式结婚了吗？（2）你打算把

'small'，

和

'small'

作为不同的词吗？（根据你的定义，它们是不同的，但也许你不是有意的。）首先用给定的单词（->simpler regex->maybe faster）和nowm查找行怎么样？当您知道它的位置时，您可以廉价地检索周围的单词。）DSM:'small'，返回两个实体（'small'和'，'））yedpodtrzitko：我以为这就是我的代码目前正在做的事情？不，你的代码所做的是试图同时找到所有东西。如果你只搜索单词（没有组，没有周围环境），你的搜索结果如何这不太可能有帮助。数据已经存储在内存中，需要花费所有时间的是

re.search

。

mmap

理论上可以降低内存消耗和/或在初始

读取时节省，但它不会使正则表达式引擎运行得更快。@user1274740然后，下一件事是尝试看看是否可以