Python 如何在一定的时间范围内从文本中提取
下面我有一个文字,如何提取文字之间的时间范围。代码可用于提取所有值Python 如何在一定的时间范围内从文本中提取,python,regex,Python,Regex,下面我有一个文字,如何提取文字之间的时间范围。代码可用于提取所有值 s = '''00:00:14,099 --> 00:00:19,100 a classic math problem a 00:00:17,039 --> 00:00:28,470 will come from an unexpected place 00:00:18,039 --> 00:00:19,470 00:00:20,039 --> 00:00:21,470 00:00:22,100
s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a
00:00:17,039 --> 00:00:28,470
will come from an unexpected place
00:00:18,039 --> 00:00:19,470
00:00:20,039 --> 00:00:21,470
00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give
00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it
00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''
import re
lines = s.split('\n')
dict = {}
for line in lines:
is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
if is_key_match_obj:
#current_key = is_key_match_obj.group()
print (current_key)
continue
if current_key:
if current_key in dict:
if not line:
dict[current_key] += '\n'
else:
dict[current_key] += line
else:
dict[current_key] = line
print(dict.values())
我可以从00:00:17039-->00:00:28470
和00:00:30119
写回所有值的代码
s = '''00:00:14,099 --> 00:00:19,100
a classic math problem a
00:00:17,039 --> 00:00:28,470
will come from an unexpected place
00:00:18,039 --> 00:00:19,470
00:00:20,039 --> 00:00:21,470
00:00:22,100 --> 00:00:30,119
binary numbers first I'm going to give
00:00:30,119 --> 00:00:35,430
puzzle and then you can try to solve it
00:00:32,489 --> 00:00:37,170
like I said you have a thousand bottles'''
import re
lines = s.split('\n')
dict = {}
for line in lines:
is_key_match_obj = re.search('([\d\:\,]{12})(\s-->\s)([\d\:\,]{12})', line)
if is_key_match_obj:
#current_key = is_key_match_obj.group()
print (current_key)
continue
if current_key:
if current_key in dict:
if not line:
dict[current_key] += '\n'
else:
dict[current_key] += line
else:
dict[current_key] = line
print(dict.values())
从00:00:17039-->00:00:28470
到00:00:30119-->00:00:35430
dict_values(['will come from an unexpected place ', '', '', 'binary numbers first I'm going to give', ' puzzle and then you can try to solve it'])
将打印:
“经典数学题a\n\n将来自意外的地方\n\n
\n\n二进制数首先,我将给出\n\n谜题,然后是您
可以尝试解决它\n\n就像我说的你有一千瓶“
解释
'\d{2}[:,\d]
捕获两位数字,后跟:
或,
或一个数字-这捕获开始和结束时间线
[\n]
:在第一个时间线之后捕获一个空格,在结束时间线之后捕获换行符
(->)*
:捕获0或更多-->
正如注释中的其他一些人所建议的,您可能希望查看通过构建解析树来为您实现这一点的解析器。它们是更充分的证据。谷歌搜索让我找到了这个无需逐行迭代。请尝试下面的代码。它会给你一本你想要的字典
import re
dict = dict(re.findall('(\d{2}:\d{2}.*)\n(.*)', s))
print(dict.values())
输出
dict_values(['a classic math problem a', 'will come from an unexpected place', '', '', "binary numbers first I'm going to give", 'puzzle and then you can try to solve it', 'like I said you have a thousand bottles'])
您能发布预期的输出吗?比如:?我宁愿使用一些库来解析这些SRT文件。有许多可用的库。@Rakesh粘贴了预期的库。时间范围是固定的还是动态的?它来自哪里?在预期的输出中,您有
puzzle,然后…
,但此文本位于00:00:30119-->00:00:35430
之后。是打字错误吗?你能解释一下regexadded解释吗[:,\d{3}]+
是一个字符类,它匹配一个或多个:
,,
,\d
,{
或}
<代码>[^\n]匹配空格、插入符号或换行符。查看regex的介绍。他们不想获取所有文本,只想在00:00:17039-->00:00:28470
和00:00:30119