在Python中使用正则表达式查找和替换文件中的单词列表_Python_Regex_Python 2.7

在Python中使用正则表达式查找和替换文件中的单词列表

python regex python-2.7

在Python中使用正则表达式查找和替换文件中的单词列表,python,regex,python-2.7,Python,Regex,Python 2.7,我想将文件的内容打印到终端，并在此过程中突出显示列表中找到的任何单词，而不修改原始文件。下面是一个尚未运行的代码示例： def highlight_故事（self）： “”“打印文件中的一行并突出显示列表中的单词。”“” _文件=打开（self.filename，“r”） file_contents=_file.read（）对于突出显示的词语： regex=re.compile( r'\b'#单词边界。 +单词#列表中的每个项目。 +r的{0,1}'，#末尾有一个可选的's'。 flags=r

我想将文件的内容打印到终端，并在此过程中突出显示列表中找到的任何单词，而不修改原始文件。下面是一个尚未运行的代码示例：

def highlight_故事（self）：
“”“打印文件中的一行并突出显示列表中的单词。”“”
_文件=打开（self.filename，“r”）
file_contents=_file.read（）
对于突出显示的词语：
regex=re.compile(
r'\b'#单词边界。
+单词#列表中的每个项目。
+r的{0,1}'，#末尾有一个可选的's'。
flags=re.IGNORECASE | re.VERBOSE）
subst='\033[1；41m'+r'\g'+'\033[0m'
结果=re.sub（regex、subst、文件内容）
打印结果
_文件.close（）
突出显示\u术语=[
“狗”，
“刺猬”，
“咕噜”
]

事实上，只有列表中的最后一项，不管它是什么或列表有多长，都会突出显示。我假设每次替换都会执行，然后在下一次迭代开始时“忘记”。它看起来像这样：

众所周知，蛴螬既能吃人也能吃非人的动物。在光线不好的地区，狗和刺猬被任何富裕的蛴螬视为美味佳肴。然而，狗可以通过音阶的叫声吓跑蛴螬。另一方面，刺猬必须服从自己的命运，成为一只适合当蛴螬王的热狗

但它应该是这样的：

众所周知，蛴螬既能吃人也能吃非人的动物。在光线不好的地区，狗和刺猬被任何富裕的蛴螬视为美味佳肴。然而，狗可以通过按音阶吠叫来吓跑蛴螬。另一方面，刺猬必须听天由命，成为一只适合当蛴螬王的热狗

如何阻止其他替换丢失？

您可以将正则表达式修改为以下内容：

regex = re.compile(r'\b('+'|'.join(highlight_terms)+r')s?', flags=re.IGNORECASE | re.VERBOSE)  # note the ? instead of {0, 1}. It has the same effect

这样，就不需要

for

循环了

此代码获取单词列表，然后用

将它们连接在一起。因此，如果您的列表类似于：

a = ['cat', 'dog', 'mouse'];

正则表达式将是：

\b(cat|dog|mouse)s?

您每次都需要通过循环将

文件内容

重新分配给替换的字符串，重新分配

文件内容

不会更改文件中的内容：

def highlight_story(self):
        """Print a line from a file and highlight words in a list."""

        the_file = open(self.filename, 'r')
        file_contents = the_file.read()
        output = ""
        for word in highlight_terms:
            regex = re.compile(
                  r'\b'      # Word boundary.
                + word       # Each item in the list.
                + r's{0,1}', # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            file_contents  = re.sub(regex, subst, file_contents) # reassign to updatedvalue
        print file_contents
        the_file.close()

def highlight_故事（self）：
“”“打印文件中的一行并突出显示列表中的单词。”“”
_文件=打开（self.filename，“r”）
file_contents=_file.read（）
output=“”
对于突出显示的词语：
regex=re.compile(
r'\b'#单词边界。
+单词#列表中的每个项目。
+r的{0,1}'，#末尾有一个可选的's'。
flags=re.IGNORECASE | re.VERBOSE）
subst='\033[1；41m'+r'\g'+'\033[0m'
file_contents=re.sub（regex、subst、file_contents）#重新分配给updatedvalue
打印文件内容
_文件.close（）

使用with打开文件也是一种更好的方法，您可以在循环外部复制字符串并在内部更新：

def highlight_story(self):
    """Print a line from a file and highlight words in a list."""
    with open(self.filename) as the_file:
        file_contents = the_file.read()
        output = file_contents # copy
        for word in highlight_terms:
            regex = re.compile(
                r'\b'  # Word boundary.
                + word  # Each item in the list.
                + r's{0,1}',  # One optional 's' at the end.
                flags=re.IGNORECASE | re.VERBOSE)
            subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
            output = re.sub(regex, subst, output) # update copy
        print output
    the_file.close()

def highlight_故事（self）：
“”“打印文件中的一行并突出显示列表中的单词。”“”
打开（self.filename）作为_文件：
file_contents=_file.read（）
输出=文件内容#复制
对于突出显示的词语：
regex=re.compile(
r'\b'#单词边界。
+单词#列表中的每个项目。
+r的{0,1}'，#末尾有一个可选的's'。
flags=re.IGNORECASE | re.VERBOSE）
subst='\033[1；41m'+r'\g'+'\033[0m'
output=re.sub（regex，subst，output）#更新副本
打印输出
_文件.close（）

提供的正则表达式是正确的，但for循环是错误的地方

result = re.sub(regex, subst, file_contents)

这一行用

文件内容中的subst
替换regex

在第二次迭代中，它再次在文件内容
中进行替换，正如您打算在结果

如何纠正
结果=文件内容
for word in highlight_terms:
    regex = re.compile(
          r'\b'      # Word boundary.
        + word       # Each item in the list.
        + r's?\b', # One optional 's' at the end.
        flags=re.IGNORECASE | re.VERBOSE)
    print regex.pattern
    subst = '\033[1;41m' + r'\g<0>' + '\033[0m'
    result = re.sub(regex, subst, result) #change made here

 print result

突出显示术语中的单词的：
regex=re.compile(
r'\b'#单词边界。
+单词#列表中的每个项目。
+r？\b'，#末尾有一个可选的“s”。
flags=re.IGNORECASE | re.VERBOSE）
打印regex.pattern
subst='\033[1；41m'+r'\g'+'\033[0m'
result=re.sub（regex，subst，result）#此处所做的更改
打印结果
@Christopherry，没问题。很高兴我能帮上忙