Python 如何在字典中解析电影脚本
我有如下数据:Python 如何在字典中解析电影脚本,python,regex,Python,Regex,我有如下数据: script = """ JOSH: How do I know if this works? MICHAEL: You would know JOSH: But how? DAN: How indeed? I don't really know. UNKNOWN: I am unknown """ 我想在[Josh,Michael.Dan]中找到每个角色所说的文本,然后忽略未知。请注意,在这个玩具示例中,每个角色每回合只有一行,但它更真实 我想最终返回一本这种
script = """
JOSH:
How do I know if this works?
MICHAEL:
You would know
JOSH:
But how?
DAN:
How indeed? I don't really know.
UNKNOWN:
I am unknown
"""
我想在[Josh,Michael.Dan]
中找到每个角色所说的文本,然后忽略未知
。请注意,在这个玩具示例中,每个角色每回合只有一行,但它更真实
我想最终返回一本这种形式的词典
lines={}
lines[Josh]=[“我如何知道这是否有效?”,“但是如何?”]
lines[Michael]=“你会知道的”
lines[Dan]=[“事实如何?”,“我真的不知道。”]
或者另一种数据结构会更好 您可以在双换行线上将脚本拆分为“块” 每个块都以包含说话人的行开始,其余的是文本 试试这个:
从集合导入defaultdict
script=”“”\
乔希:
我怎么知道这是否有效?
迈克尔:
你会知道的
乔希:
但是怎么做呢?
丹:
真的吗?我真的不知道。
未知:
我不知道
"""
行\u blocks=script.split(“\n\n”)
通缉犯姓名={name.upper()+“:”:姓名在[“Josh”、“Michael”、“Dan”]}
结果=defaultdict(列表)
对于行中的块_块:
名称,text=block.split(“\n”,1)
如果要输入名称,请输入以下名称:
结果[通缉犯姓名[姓名]]。追加(文本)
打印(结果[“Josh”])
打印(结果[“Michael”])
打印(结果[“Dan”])
输出:
['How do I know if this works?', 'But how? ']
['You would know']
["How indeed? I don't really know. "]
我不是很确定你的最终结构,但是,如果它是非常一致的,你可以使用正则表达式 这是我的密码:
import re
script = """
JOSH:
How do I know if this works?
MICHAEL:
You would know
JOSH:
But how?
DAN:
How indeed? I don't really know.
UNKNOWN:
I am unknown
"""
# This regex is extracting two groups.
# The first one is one or more words before the ":" (the character's name)
# The second one will be everything between newlines (the line)
matcher = re.compile("(\w+):\n(.*)\n")
groups_extracted = matcher.findall(script)
result = {}
for element in groups_extracted:
# A little verbosity to make understanding easier
author = element[0]
line = element[1]
if author in result:
# In case the author name is already in the result dict
# we just append a new line on his / her name
result[author].append(line)
else:
# Otherwise the author name needs to be added to the dict
# from scratch with his / her 1st line
result[author] = [line]
print(result)
print(result['JOSH'])
{'JOSH':['我怎么知道这是否有效?','但是如何?'],'MICHAEL':['你会知道'],'DAN':[“事实上如何?我真的不知道。”],'UNKNOWN':['我不知道]
[“我如何知道这是否有效?”,“但是如何?”]
我为每个名称添加了几行,以接近实际任务,并使用正则表达式安全地执行此操作:
import re
import pprint
script = """
JOSH:
How do I know if this works?
And here is another line for JOSH
MICHAEL:
You would know
And another line for MICHAEL
JOSH:
But how?
One more for JOSH
DAN:
How indeed? I don't really know.
One more for DAN
UNKNOWN:
I am unknown
"""
# split by paragraph, by at least 2 consecutive newlines
pars = re.split(r'\n\n+', script, re.S + re.M)
d = {}
for p in pars: # for each paragraph
# capture the name (anchored to beginning of line and all capitals)
# and the rest of the paragraph - (.*)
name, txt = re.search(r'^([A-Z]+):(.*)', p, re.S + re.M).group(1, 2)
# Each sentence as a list item
if name in d:
d[name] += txt.strip().split('\n')
else:
d[name] = txt.strip().split('\n')
pprint.pprint(d)
输出
请提供一个
dict
可能不是您想要使用的容器。如果您使用每个字符作为键,那么在一天结束时,每个字符只剩下一行:可能行[Dan]=[“事实如何?”,“我真的不知道。”]
会更多suitable@Chris我认为在这种情况下,dict
是合适的(如果目标是按字符提取所有行)。只需将行列表存储在dict
中,而不是一行。@Chris只需检查dict键是否存在。如果是这样,请附加一行。这取决于“每个角色每回合只有一行,但它更真实”的含义。请记住,此解决方案假设角色说话的每个实例中都没有换行符。@Aaron正确,可能通过双换行符拆分为“块”,然后通过每个字符的第一行映射到名称“块”。将进行编辑。
{'DAN': ["How indeed? I don't really know. ", 'One more for DAN'],
'JOSH': ['How do I know if this works?',
'And here is another line for JOSH',
'But how? ',
'One more for JOSH'],
'MICHAEL': ['You would know', 'And another line for MICHAEL'],
'UNKNOWN': ['I am unknown']}