Python 删除循环中的正则表达式匹配项并继续更新字符串版本_Python_Regex_Pattern Matching

Python 删除循环中的正则表达式匹配项并继续更新字符串版本

python regex

Python 删除循环中的正则表达式匹配项并继续更新字符串版本,python,regex,pattern-matching,Python,Regex,Pattern Matching,我有一个字符串，我想在四个单词列表中运行，一个有四个gram，一个有三个gram，一个有bigram，还有一个有单个词。例如，为了避免单术语单词列表中的一个单词在同时构成一个或多个三角形的一部分时被计数两次，我从计算四克开始，然后希望更新字符串以删除匹配项，从而仅检查字符串的其余部分，分别检查三角形、双格形和单术语的匹配情况。我已经使用了以下代码，并在这里以四格图和三叉图为起点进行了说明： financial_trigrams_count=0 financial_fourgrams_count=

我有一个字符串，我想在四个单词列表中运行，一个有四个gram，一个有三个gram，一个有bigram，还有一个有单个词。例如，为了避免单术语单词列表中的一个单词在同时构成一个或多个三角形的一部分时被计数两次，我从计算四克开始，然后希望更新字符串以删除匹配项，从而仅检查字符串的其余部分，分别检查三角形、双格形和单术语的匹配情况。我已经使用了以下代码，并在这里以四格图和三叉图为起点进行了说明：

financial_trigrams_count=0
financial_fourgrams_count=0

strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."

pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]

for i in pattern_fourgrams:
    financial_fourgrams_count=financial_fourgrams_count+strn.count(i)

new_strn=strn
def clean_text1(pattern_fourgrams, new_strn):
    for r in pattern_fourgrams:
        new_strn = re.sub(r, '', new_strn)
    return new_strn

for i in pattern_trigrams:
    financial_trigrams_count=financial_trigrams_count+new_strn.count(i)

new_strn1=new_strn
def clean_text2(pattern_trigrams, new_strn1):
    for r in pattern_trigrams:
        new_strn1 = re.sub(r, '', new_strn1)
    return new_strn1

print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

对于四格，没有匹配项，因此新的字符串将类似于字符串。然而，有一个匹配trigrams（“首席财务官”），但是，我没有成功地从new_strn1中删除该匹配。相反，我再次生成完整的字符串，即strn（或相同的new_strn）

有人能帮我找到这里的错误吗？

您需要删除

def

import re
financial_trigrams_count=0
financial_fourgrams_count=0

strn="thank you, john, and good morning, everyone. with me today is tim, our chief financial officer."

pattern_fourgrams=["value to the business", "car and truck sales"]
pattern_trigrams=["cash flow statement", "chief financial officer"]

for i in pattern_fourgrams:
    financial_fourgrams_count=financial_fourgrams_count+strn.count(i)

new_strn=strn
for r in pattern_fourgrams:
    new_strn = re.sub(r, '', new_strn)

for i in pattern_trigrams:
    financial_trigrams_count=financial_trigrams_count+new_strn.count(i)

new_strn1=new_strn
for r in pattern_trigrams:
    new_strn1 = re.sub(r, '', new_strn1)

print(new_strn1)
print(financial_fourgrams_count)
print(financial_trigrams_count)
word_count_wostop=len(strn.split())
print(word_count_wostop)

（作为Tilak Putta回答的补充）

请注意，您正在搜索字符串两次：一次是使用

.count（）

计算ngram的出现次数，另一次是使用

re.sub（）

删除匹配项

您可以通过同时计数和删除来提高性能
这可以使用
re.subn
完成。此函数采用与
re.sub
相同的参数，但返回一个元组，其中包含已清除的字符串以及匹配数
例如：

for i in pattern_fourgrams: new_strn, n = re.subn(r, '', new_strn) financial_fourgrams_count += n

请注意，这假设n-gram是成对不同的（对于固定的n），即它们不应该有一个共同的词，因为
subn
将在第一次看到该词时删除该词，因此无法找到包含该特定词的其他ngram。你知道你定义的两个函数从未被调用吗？