Python 如何在一定的时间范围内从文本中提取_Python_Regex

Python 如何在一定的时间范围内从文本中提取

python regex

Python 如何在一定的时间范围内从文本中提取,python,regex,Python,Regex,下面我有一个文字，如何提取文字之间的时间范围。代码可用于提取所有值 s = '''00:00:14,099 --> 00:00:19,100 a classic math problem a 00:00:17,039 --> 00:00:28,470 will come from an unexpected place 00:00:18,039 --> 00:00:19,470 00:00:20,039 --> 00:00:21,470 00:00:22,100

下面我有一个文字，如何提取文字之间的时间范围。代码可用于提取所有值

s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a

00:00:17,039 --> 00:00:28,470
will come from an unexpected place

00:00:18,039 --> 00:00:19,470

00:00:20,039 --> 00:00:21,470

00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give

00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it

00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''

import re
lines = s.split('\n')
dict = {}

for line in lines:
    is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
    if is_key_match_obj:
        #current_key = is_key_match_obj.group()
        print (current_key)
        continue

    if current_key:
        if current_key in dict:
            if not line:
                dict[current_key] += '\n'
            else:
                dict[current_key] += line
        else:
              dict[current_key] = line

print(dict.values())

我可以从

00:00:17039-->00:00:28470

和

00:00:30119

写回所有值的代码

s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a

00:00:17,039 --> 00:00:28,470
will come from an unexpected place

00:00:18,039 --> 00:00:19,470

00:00:20,039 --> 00:00:21,470

00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give

00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it

00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''

import re
lines = s.split('\n')
dict = {}

for line in lines:
    is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
    if is_key_match_obj:
        #current_key = is_key_match_obj.group()
        print (current_key)
        continue

    if current_key:
        if current_key in dict:
            if not line:
                dict[current_key] += '\n'
            else:
                dict[current_key] += line
        else:
              dict[current_key] = line

print(dict.values())

从

00:00:17039-->00:00:28470

到

00:00:30119-->00:00:35430

dict_values(['will come from an unexpected place ', '', '', 'binary numbers first I'm going to give', ' puzzle and then you can try to solve it'])

将打印：

“经典数学题a\n\n将来自意外的地方\n\n \n\n二进制数首先，我将给出\n\n谜题，然后是您可以尝试解决它\n\n就像我说的你有一千瓶“

解释

'\d{2}[：，\d]

捕获两位数字，后跟

：

或

，

或一个数字-这捕获开始和结束时间线

[\n]

：在第一个时间线之后捕获一个空格，在结束时间线之后捕获换行符

（->）*

：捕获0或更多-->

正如注释中的其他一些人所建议的，您可能希望查看通过构建解析树来为您实现这一点的解析器。它们是更充分的证据。谷歌搜索让我找到了这个

无需逐行迭代。请尝试下面的代码。它会给你一本你想要的字典

import re
dict = dict(re.findall('(\d{2}:\d{2}.*)\n(.*)', s))
print(dict.values())

输出

dict_values(['a classic math problem a', 'will come from an unexpected place', '', '', "binary numbers first I'm going to give", 'puzzle and then you can try to solve it', 'like I said you have a thousand bottles'])

您能发布预期的输出吗？比如：？我宁愿使用一些库来解析这些SRT文件。有许多可用的库。@Rakesh粘贴了预期的库。时间范围是固定的还是动态的？它来自哪里？在预期的输出中，您有

puzzle，然后…

，但此文本位于

00:00:30119-->00:00:35430

之后。是打字错误吗？你能解释一下regexadded解释吗

[：，\d{3}]+

是一个字符类，它匹配一个或多个

：

，

，

，

\d

，

或

<代码>[^\n]匹配空格、插入符号或换行符。查看regex的介绍。他们不想获取所有文本，只想在

00:00:17039-->00:00:28470

和

00:00:30119