Python 仅删除文本中的特定句号会意外失败_Python_Python 3.x_Regex_Class_Oop

Python 仅删除文本中的特定句号会意外失败

python python-3.x regex class oop

Python 仅删除文本中的特定句号会意外失败,python,python-3.x,regex,class,oop,Python,Python 3.x,Regex,Class,Oop,我需要以两种不同的文件格式预处理成绩单，即SRT和WebVTT文件。我的目标是从文本行中删除标点符号，但不是从时间戳中删除。由于WebVTT文件中的时间戳包含句号而不是逗号（与SRT文件相反），因此预处理在删除句号方面有所不同时间戳内的句号必须保持不变，而文本行中的句号应删除输入文件如下所示： 00:00:09.761 --> 00:00:13.864 The Sahara Desert is one of the least hospitable climates on Earth

我需要以两种不同的文件格式预处理成绩单，即SRT和WebVTT文件。我的目标是从文本行中删除标点符号，但不是从时间戳中删除。由于WebVTT文件中的时间戳包含句号而不是逗号（与SRT文件相反），因此预处理在删除句号方面有所不同

时间戳内的句号必须保持不变，而文本行中的句号应删除

输入文件如下所示：

00:00:09.761 --> 00:00:13.864
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:14.340 --> 00:00:23.670
Its barren plateaus, rocky peaks and shifting sands envelop the northern third of Africa, which sees very little rain, vegetation and life.


00:00:24.440 --> 00:00:29.100
Meanwhile, across the Atlantic Ocean, thrives the world's largest rainforest.

这是我各自的代码：

import re

class Prep:
    def __init__(self, transcript_filename):
        self.transcript_filename = transcript_filename
        self.transcript = self.read_file()
    
    def read_file(self):
        f = open(self.transcript_filename, "r")
        data = f.read()
        f.close()
        
        return data
    
    def preprocessing(self):
        # Remove noisy punctuation from the transcript.
        prep_transcript = self.transcript.replace("'", '')
        prep_transcript = prep_transcript.replace(';', '')
        prep_transcript = prep_transcript.replace('!', '')
        prep_transcript = prep_transcript.replace('?', '')
        prep_transcript = re.sub(r",\D\b", " ", prep_transcript,
                                 flags=re.MULTILINE)
        prep_transcript = re.sub(r",\n", "\n", prep_transcript,
                                 flags=re.MULTILINE)
        """Handle full stops differently in .vtt and .srt files to remove
        varyingly structured timestamps."""
        if self.transcript_filename.endswith(".vtt"):
            pattern = re.compile(r"\d{2}\.\d{3}")
            if pattern.search(prep_transcript):
                pass
            else:
                prep_transcript = prep_transcript.replace('.', '')
        elif self.transcript_filename.endswith(".srt"):
            prep_transcript = prep_transcript.replace('.', '')

        return prep_transcript
    
inst = Prep("sample_transcript.vtt")
print(inst.preprocessing())

在SRT转录本文件上，上述预处理步骤工作正常。但对于WebVTT文件，它们仅适用于逗号、问号等。但无论出于何种原因，它们不适用于句号，因为它们仍保留在输出中：

00:00:07.318 --> 00:00:15.654
The Sahara Desert is one of the least hospitable climates on Earth.


00:00:17.310 --> 00:00:25.679
Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life.


00:00:26.440 --> 00:00:29.100
Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest.

相反，输出应该如下所示：

00:00:07.318 --> 00:00:15.654 The Sahara Desert is one of the least hospitable climates on Earth 00:00:17.310 --> 00:00:25.679 Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life 00:00:26.440 --> 00:00:29.100 Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest
谁能告诉我我做错了什么？
我非常感谢你的帮助和提示
您可以缩短
下的前4个replace语句，从成绩单中删除嘈杂的标点。
使用re.sub使用单字符类
要在时间戳中保留点，例如，您可以在“否”后面直接跟一个数字匹配一个点
由于所有语句都将匹配项替换为空字符串，因此可以使用替代的
组合它们
更新行可能如下所示：

# Remove noisy punctuation from the transcript. prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)
输出

00:00:07.318 --> 00:00:15.654 The Sahara Desert is one of the least hospitable climates on Earth 00:00:17.310 --> 00:00:25.679 Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life 00:00:26.440 --> 00:00:29.100 Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

您可以缩短
下的前4个replace语句，从成绩单中删除嘈杂的标点。
使用re.sub使用单个字符类
要在时间戳中保留点，例如，您可以在“否”后面直接跟一个数字匹配一个点
由于所有语句都将匹配项替换为空字符串，因此可以使用替代的
组合它们
更新行可能如下所示：

# Remove noisy punctuation from the transcript. prep_transcript = re.sub(r"[';!?]|\.(?!\d)", '', self.transcript)
输出

00:00:07.318 --> 00:00:15.654 The Sahara Desert is one of the least hospitable climates on Earth 00:00:17.310 --> 00:00:25.679 Its barren plateaus rocky peaks and shifting sands envelop the northern third of Africa which sees very little rain vegetation and life 00:00:26.440 --> 00:00:29.100 Meanwhile across the Atlantic Ocean thrives the worlds largest rainforest

请提供预期的价格。显示中间结果与预期结果的偏差。我们应该能够将单个代码块粘贴到文件中，运行它，并重现您的问题。这也让我们可以在您的上下文中测试任何建议；用一个简单的内部数据集替换它。@Prune：很遗憾，我不允许显示实际的数据集。但我现在已经编辑了我的问题，以便我的中间结果和期望输出之间的差异应该变得清晰。再次，请参考发布指南。您提供虚拟数据。您提供了复制错误的工作代码。@Prune:我现在已经调整了数据和代码。请提供预期的。显示中间结果与预期结果的偏差。我们应该能够将单个代码块粘贴到文件中，运行它，并重现您的问题。这也让我们可以在您的上下文中测试任何建议；用一个简单的内部数据集替换它。@Prune：很遗憾，我不允许显示实际的数据集。但我现在已经编辑了我的问题，以便我的中间结果和期望输出之间的差异应该变得清晰。再次，请参考发布指南。您提供虚拟数据。您提供了复制错误的工作代码。@Prune：我现在已经调整了数据和代码。很好，只需添加“，”to
r“[”！？][124\.（？！\ d）”
成为
（r“[”，；！？][124\.（？！\ d）”
很好，只需添加“，”to
r“[”；！？][124\.（？！\ d）”
成为
（r“[”，；！？！！？][\d）