Python 如何在一定的时间范围内从文本中提取

Python 如何在一定的时间范围内从文本中提取,python,regex,Python,Regex,下面我有一个文字,如何提取文字之间的时间范围。代码可用于提取所有值 s = '''00:00:14,099 --> 00:00:19,100 a classic math problem a 00:00:17,039 --> 00:00:28,470 will come from an unexpected place 00:00:18,039 --> 00:00:19,470 00:00:20,039 --> 00:00:21,470 00:00:22,100

下面我有一个文字,如何提取文字之间的时间范围。代码可用于提取所有值

s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a

00:00:17,039 --> 00:00:28,470
will come from an unexpected place

00:00:18,039 --> 00:00:19,470

00:00:20,039 --> 00:00:21,470

00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give

00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it

00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''
import re
lines = s.split('\n')
dict = {}

for line in lines:
    is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
    if is_key_match_obj:
        #current_key = is_key_match_obj.group()
        print (current_key)
        continue

    if current_key:
        if current_key in dict:
            if not line:
                dict[current_key] += '\n'
            else:
                dict[current_key] += line
        else:
              dict[current_key] = line

print(dict.values())
我可以从
00:00:17039-->00:00:28470
00:00:30119

写回所有值的代码

s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a

00:00:17,039 --> 00:00:28,470
will come from an unexpected place

00:00:18,039 --> 00:00:19,470

00:00:20,039 --> 00:00:21,470

00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give

00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it

00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''
import re
lines = s.split('\n')
dict = {}

for line in lines:
    is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
    if is_key_match_obj:
        #current_key = is_key_match_obj.group()
        print (current_key)
        continue

    if current_key:
        if current_key in dict:
            if not line:
                dict[current_key] += '\n'
            else:
                dict[current_key] += line
        else:
              dict[current_key] = line

print(dict.values())
00:00:17039-->00:00:28470
00:00:30119-->00:00:35430

dict_values(['will come from an unexpected place ', '', '', 'binary numbers first I'm going to give', ' puzzle and then you can try to solve it'])
将打印:

“经典数学题a\n\n将来自意外的地方\n\n \n\n二进制数首先,我将给出\n\n谜题,然后是您 可以尝试解决它\n\n就像我说的你有一千瓶“

解释

'\d{2}[:,\d]
捕获两位数字,后跟
或一个数字-这捕获开始和结束时间线

[\n]
:在第一个时间线之后捕获一个空格,在结束时间线之后捕获换行符

(->)*
:捕获0或更多-->


正如注释中的其他一些人所建议的,您可能希望查看通过构建解析树来为您实现这一点的解析器。它们是更充分的证据。谷歌搜索让我找到了这个

无需逐行迭代。请尝试下面的代码。它会给你一本你想要的字典

import re
dict = dict(re.findall('(\d{2}:\d{2}.*)\n(.*)', s))
print(dict.values())
输出

dict_values(['a classic math problem a', 'will come from an unexpected place', '', '', "binary numbers first I'm going to give", 'puzzle and then you can try to solve it', 'like I said you have a thousand bottles'])

您能发布预期的输出吗?比如:?我宁愿使用一些库来解析这些SRT文件。有许多可用的库。@Rakesh粘贴了预期的库。时间范围是固定的还是动态的?它来自哪里?在预期的输出中,您有
puzzle,然后…
,但此文本位于
00:00:30119-->00:00:35430
之后。是打字错误吗?你能解释一下regexadded解释吗
[:,\d{3}]+
是一个字符类,它匹配一个或多个
\d
{
}
<代码>[^\n]匹配空格、插入符号或换行符。查看regex的介绍。他们不想获取所有文本,只想在
00:00:17039-->00:00:28470
00:00:30119