在Python3中加速数百万个正则表达式的替换_Python_Regex_String_Performance_Replace

在Python3中加速数百万个正则表达式的替换

python regex string performance replace

在Python3中加速数百万个正则表达式的替换,python,regex,string,performance,replace,Python,Regex,String,Performance,Replace,我有两份清单：大约750K的“句子”列表（长字符串）我想从750K句中删除的大约20K个“单词” 因此，我必须循环使用750K个句子，并执行大约20K个替换，但前提是我的单词实际上是“单词”，并且不是较大字符串的一部分。我是通过预编译单词来实现这一点的，这样单词的两侧就有\b单词边界元字符： compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words] 然后我循环我的“句子”：这个嵌套循环每

我有两份清单：

大约750K的“句子”列表（长字符串）
我想从750K句中删除的大约20K个“单词”

因此，我必须循环使用750K个句子，并执行大约20K个替换，但前提是我的单词实际上是“单词”，并且不是较大字符串的一部分。

我是通过预编译单词来实现这一点的，这样单词的两侧就有

\b

单词边界元字符：

compiled_words = [re.compile(r'\b' + word + r'\b') for word in my20000words]

然后我循环我的“句子”：

这个嵌套循环每秒处理大约50个句子，这很好，但仍然需要几个小时来处理我所有的句子

有没有一种方法可以使用
```
str.replace
```
方法（我认为它更快），但仍然要求只在单词边界处进行替换

或者，有没有办法加速
re.sub
方法？我已经通过跳过
re.sub
（如果我的单词长度>大于我的句子长度）略微提高了速度，但这并没有多大改善

我正在使用Python 3.5.2，您可能想尝试的一件事是对句子进行预处理，以对单词边界进行编码。基本上，通过在单词边界上拆分，将每个句子变成一个单词列表
这应该更快，因为要处理一个句子，你只需要逐字逐句地检查每个单词是否匹配

目前，正则表达式搜索每次都必须重新遍历整个字符串，查找单词边界，然后在下一次遍历之前“丢弃”这项工作的结果。
您可能需要尝试的一件事是对句子进行预处理，以对单词边界进行编码。基本上，通过在单词边界上拆分，将每个句子变成一个单词列表
这应该更快，因为要处理一个句子，你只需要逐字逐句地检查每个单词是否匹配

目前，正则表达式搜索每次都必须重新遍历整个字符串，寻找单词边界，然后在下一次遍历之前“丢弃”这项工作的结果。
可以尝试的一件事是编译一个模式，如
“\b（word1 | word2 | word3）\b”
因为
re
依赖于C代码来进行实际匹配，因此节省了大量的成本
正如@pvg在评论中指出的，它还受益于单通道匹配

如果您的单词不是正则表达式，Eric的更快。
您可以尝试编译一个模式，如
“\b（word1 | word2 | word3）\b”
因为
re
依赖于C代码来进行实际匹配，因此节省了大量的成本
正如@pvg在评论中指出的，它还受益于单通道匹配

如果你的单词不是正则表达式，Eric的更快。
也许Python不是合适的工具。下面是一个Unix工具链

sed G file | tr ' ' '\n' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1'
假设您的黑名单文件经过预处理并添加了单词边界。步骤是：将文件转换为双倍行距，将每个句子拆分为每行一个单词，从文件中批量删除黑名单中的单词，然后合并回这些行
这应该至少快一个数量级
用于从单词（每行一个单词）预处理黑名单文件

也许Python在这里不是合适的工具。下面是一个Unix工具链

sed G file | tr ' ' '\n' | grep -vf blacklist | awk -v RS= -v OFS=' ' '{$1=$1}1'
假设您的黑名单文件经过预处理并添加了单词边界。步骤是：将文件转换为双倍行距，将每个句子拆分为每行一个单词，从文件中批量删除黑名单中的单词，然后合并回这些行
这应该至少快一个数量级
用于从单词（每行一个单词）预处理黑名单文件
实用方法下面描述的解决方案使用大量内存将所有文本存储在同一个字符串中，并降低复杂性。如果RAM是一个问题，请在使用前三思而后行
使用
join
/
split
技巧，您可以完全避免循环，从而加速算法

将一个句子与一个不包含在句子中的特殊分词连接起来：

merged_sentences = ' * '.join(sentences)

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

使用
|
或“regex语句：

regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

用编译后的正则表达式为单词下标，并用特殊分隔符将其拆分为独立的句子：

merged_sentences = ' * '.join(sentences)

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')
演出
“”。join
复杂性为O（n）。这是非常直观的，但无论如何，有一个来源的简短引用：

for (i = 0; i < seqlen; i++) { [...] sz += PyUnicode_GET_LENGTH(item);
（i=0；i{ [...] sz+=PyUnicode_GET_长度（项目）； 因此，使用
join/split
可以得到O（单词）+2*O（句子），与初始方法的2*O（N2）相比，这仍然是线性复杂度

b、 t.w.不要使用多线程。GIL将阻止每个操作，因为您的任务严格受CPU限制，所以GIL没有机会被释放，但每个线程将同时发送勾号，这会导致额外的工作，甚至导致操作无限大。
实用方法下面描述的解决方案使用大量内存将所有文本存储在同一个字符串中，并降低复杂性。如果RAM是一个问题，请在使用它之前三思而后行
使用
join
/
split
技巧，您可以完全避免循环，从而加速算法

将一个句子与一个不包含在句子中的特殊分词连接起来：

merged_sentences = ' * '.join(sentences)

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

使用
|
或“regex语句：

regex = re.compile(r'\b({})\b'.format('|'.join(words)), re.I) # re.I is a case insensitive flag

用编译后的正则表达式为单词下标，并用特殊分隔符将其拆分为独立的句子：

merged_sentences = ' * '.join(sentences)

clean_sentences = re.sub(regex, "", merged_sentences).split(' * ')

def replace4( sentences ): pd = patterns_dict.get def repl(m): w = m.group() return pd(w.lower(),w)

#!/usr/bin/env python3 from __future__ import unicode_literals, print_function import re import time import io def replace_sentences_1(sentences, banned_words): # faster on CPython, but does not use \b as the word separator # so result is slightly different than replace_sentences_2() def filter_sentence(sentence): words = WORD_SPLITTER.split(sentence) words_iter = iter(words) for word in words_iter: norm_word = word.lower() if norm_word not in banned_words: yield word yield next(words_iter) # yield the word separator WORD_SPLITTER = re.compile(r'(\W+)') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) def replace_sentences_2(sentences, banned_words): # slower on CPython, uses \b as separator def filter_sentence(sentence): boundaries = WORD_BOUNDARY.finditer(sentence) current_boundary = 0 while True: last_word_boundary, current_boundary = current_boundary, next(boundaries).start() yield sentence[last_word_boundary:current_boundary] # yield the separators last_word_boundary, current_boundary = current_boundary, next(boundaries).start() word = sentence[last_word_boundary:current_boundary] norm_word = word.lower() if norm_word not in banned_words: yield word WORD_BOUNDARY = re.compile(r'\b') banned_words = set(banned_words) for sentence in sentences: yield ''.join(filter_sentence(sentence)) corpus = io.open('corpus2.txt').read() banned_words = [l.lower() for l in open('banned_words.txt').read().splitlines()] sentences = corpus.split('. ') output = io.open('output.txt', 'wb') print('number of sentences:', len(sentences)) start = time.time() for sentence in replace_sentences_1(sentences, banned_words): output.write(sentence.encode('utf-8')) output.write(b' .') print('time:', time.time() - start)

$ # replace_sentences_1() $ python3 filter_words.py number of sentences: 862462 time: 24.46173644065857 $ pypy filter_words.py number of sentences: 862462 time: 15.9370770454 $ # replace_sentences_2() $ python3 filter_words.py number of sentences: 862462 time: 40.2742919921875 $ pypy filter_words.py number of sentences: 862462 time: 13.1190629005

['foobar', 'foobah', 'fooxar', 'foozap', 'fooza']

{ 'f': { 'o': { 'o': { 'x': { 'a': { 'r': { '': 1 } } }, 'b': { 'a': { 'r': { '': 1 }, 'h': { '': 1 } } }, 'z': { 'a': { '': 1, 'p': { '': 1 } } } } } } }

r"\bfoo(?:ba[hr]|xar|zap?)\b"

import re class Trie(): """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.""" def __init__(self): self.data = {} def add(self, word): ref = self.data for char in word: ref[char] = char in ref and ref[char] or {} ref = ref[char] ref[''] = 1 def dump(self): return self.data def quote(self, char): return re.escape(char) def _pattern(self, pData): data = pData if "" in data and len(data.keys()) == 1: return None alt = [] cc = [] q = 0 for char in sorted(data.keys()): if isinstance(data[char], dict): try: recurse = self._pattern(data[char]) alt.append(self.quote(char) + recurse) except: cc.append(self.quote(char)) else: q = 1 cconly = not len(alt) > 0 if len(cc) > 0: if len(cc) == 1: alt.append(cc[0]) else: alt.append('[' + ''.join(cc) + ']') if len(alt) == 1: result = alt[0] else: result = "(?:" + "|".join(alt) + ")" if q: if cconly: result += "?" else: result = "(?:%s)?" % result return result def pattern(self): return self._pattern(self.dump())

# Encoding: utf-8 import re import timeit import random from trie import Trie with open('/usr/share/dict/american-english') as wordbook: banned_words = [word.strip().lower() for word in wordbook] random.shuffle(banned_words) test_words = [ ("Surely not a word", "#surely_NöTäWORD_so_regex_engine_can_return_fast"), ("First word", banned_words[0]), ("Last word", banned_words[-1]), ("Almost a word", "couldbeaword") ] def trie_regex_from_words(words): trie = Trie() for word in words: trie.add(word) return re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) def find(word): def fun(): return union.match(word) return fun for exp in range(1, 6): print("\nTrieRegex of %d words" % 10**exp) union = trie_regex_from_words(banned_words[:10**exp]) for description, test_word in test_words: time = timeit.timeit(find(test_word), number=1000) * 1000 print(" %s : %.1fms" % (description, time))

TrieRegex of 10 words Surely not a word : 0.3ms First word : 0.4ms Last word : 0.5ms Almost a word : 0.5ms TrieRegex of 100 words Surely not a word : 0.3ms First word : 0.5ms Last word : 0.9ms Almost a word : 0.6ms TrieRegex of 1000 words Surely not a word : 0.3ms First word : 0.7ms Last word : 0.9ms Almost a word : 1.1ms TrieRegex of 10000 words Surely not a word : 0.1ms First word : 1.0ms Last word : 1.2ms Almost a word : 1.2ms TrieRegex of 100000 words Surely not a word : 0.3ms First word : 1.2ms Last word : 0.9ms Almost a word : 1.6ms