Python 如何在字典中解析电影脚本_Python_Regex

Python 如何在字典中解析电影脚本

python regex

Python 如何在字典中解析电影脚本,python,regex,Python,Regex,我有如下数据： script = """ JOSH: How do I know if this works? MICHAEL: You would know JOSH: But how? DAN: How indeed? I don't really know. UNKNOWN: I am unknown """ 我想在[Josh，Michael.Dan]中找到每个角色所说的文本，然后忽略未知。请注意，在这个玩具示例中，每个角色每回合只有一行，但它更真实我想最终返回一本这种

我有如下数据：

script = """
JOSH:
How do I know if this works?

MICHAEL:
You would know

JOSH:
But how? 

DAN:
How indeed? I don't really know. 


UNKNOWN: 
I am unknown
"""

我想在

[Josh，Michael.Dan]

中找到每个角色所说的文本，然后忽略

未知

。请注意，在这个玩具示例中，每个角色每回合只有一行，但它更真实

我想最终返回一本这种形式的词典

lines={}

lines[Josh]=[“我如何知道这是否有效？”，“但是如何？”]

lines[Michael]=“你会知道的”

lines[Dan]=[“事实如何？”，“我真的不知道。”]

或者另一种数据结构会更好

您可以在双换行线上将脚本拆分为“块”

每个块都以包含说话人的行开始，其余的是文本

试试这个：

从集合导入defaultdict
script=”“”\
乔希：
我怎么知道这是否有效？
迈克尔：
你会知道的
乔希：
但是怎么做呢？
丹：
真的吗？我真的不知道。
未知：
我不知道
"""
行\u blocks=script.split（“\n\n”）
通缉犯姓名={name.upper（）+“：”：姓名在[“Josh”、“Michael”、“Dan”]}
结果=defaultdict（列表）
对于行中的块_块：
名称，text=block.split（“\n”，1）
如果要输入名称，请输入以下名称：
结果[通缉犯姓名[姓名]]。追加（文本）
打印（结果[“Josh”]）
打印（结果[“Michael”]）
打印（结果[“Dan”]）

输出：

['How do I know if this works?', 'But how? ']
['You would know']
["How indeed? I don't really know. "]

我不是很确定你的最终结构，但是，如果它是非常一致的，你可以使用正则表达式

这是我的密码：

import re

script = """
JOSH:
How do I know if this works?

MICHAEL:
You would know

JOSH:
But how? 

DAN:
How indeed? I don't really know. 

UNKNOWN:
I am unknown

"""
# This regex is extracting two groups.
# The first one is one or more words before the ":" (the character's name)
# The second one will be everything between newlines (the line)
matcher = re.compile("(\w+):\n(.*)\n")
groups_extracted = matcher.findall(script)

result = {}

for element in groups_extracted:
    # A little verbosity to make understanding easier
    author = element[0]
    line = element[1]
    if author in result:
        # In case the author name is already in the result dict
        # we just append a new line on his / her name
        result[author].append(line)
    else:
        # Otherwise the author name needs to be added to the dict
        # from scratch with his / her 1st line
        result[author] = [line]

print(result)

print(result['JOSH'])

{'JOSH'：['我怎么知道这是否有效？'，'但是如何？']，'MICHAEL'：['你会知道']，'DAN'：[“事实上如何？我真的不知道。”]，'UNKNOWN'：['我不知道]

[“我如何知道这是否有效？”，“但是如何？”]

我为每个名称添加了几行，以接近实际任务，并使用正则表达式安全地执行此操作：

import re
import pprint

script = """
JOSH:
How do I know if this works?
And here is another line for JOSH

MICHAEL:
You would know
And another line for MICHAEL

JOSH:
But how? 
One more for JOSH

DAN:
How indeed? I don't really know. 
One more for DAN


UNKNOWN: 
I am unknown
"""

# split by paragraph, by at least 2 consecutive newlines
pars = re.split(r'\n\n+', script, re.S + re.M)
d = {}

for p in pars:  # for each paragraph
    # capture the name (anchored to beginning of line and all capitals)
    # and the rest of the paragraph - (.*)
    name, txt = re.search(r'^([A-Z]+):(.*)', p, re.S + re.M).group(1, 2)

    # Each sentence as a list item
    if name in d:
        d[name] += txt.strip().split('\n')
    else:
        d[name] = txt.strip().split('\n')



pprint.pprint(d)

输出

请提供一个

dict

可能不是您想要使用的容器。如果您使用每个字符作为键，那么在一天结束时，每个字符只剩下一行：可能

行[Dan]=[“事实如何？”，“我真的不知道。”]

会更多suitable@Chris我认为在这种情况下，

dict

是合适的（如果目标是按字符提取所有行）。只需将行列表存储在

dict

中，而不是一行。@Chris只需检查dict键是否存在。如果是这样，请附加一行。这取决于“每个角色每回合只有一行，但它更真实”的含义。请记住，此解决方案假设角色说话的每个实例中都没有换行符。@Aaron正确，可能通过双换行符拆分为“块”，然后通过每个字符的第一行映射到名称“块”。将进行编辑。

{'DAN': ["How indeed? I don't really know. ", 'One more for DAN'],
 'JOSH': ['How do I know if this works?',
      'And here is another line for JOSH',
      'But how? ',
      'One more for JOSH'],
 'MICHAEL': ['You would know', 'And another line for MICHAEL'],
 'UNKNOWN': ['I am unknown']}