Python 从数组/列表中将字符名及其行添加到新字典_Python_Regex_Text_Nltk_Analysis

Python 从数组/列表中将字符名及其行添加到新字典

python regex text

Python 从数组/列表中将字符名及其行添加到新字典,python,regex,text,nltk,analysis,Python,Regex,Text,Nltk,Analysis,我有一个电影剧本。我的第一项工作是收集字典中每个字符的行稍后我需要将数据放入一个系列中现在，我把所有的对话都列在一个列表中，从角色名开始。它的格式如下： NAME1 Yo, Yo, good that you're here man. NAME2 (Laughing) I don't think that's good! We were at the club, smoking, laughing -- doing stuff. exa

我有一个电影剧本。我的第一项工作是收集字典中每个字符的行

稍后我需要将数据放入一个系列中

现在，我把所有的对话都列在一个列表中，从角色名开始。它的格式如下：

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

对话[0] 'NAME1\n（16个空格）哟，哟，很高兴你在这里，伙计。'

所有名称都以\n结尾。然后所有的对话都以16个空格开始。我认为这可能是有用的，但我不知道如何利用这一点

我试过很多方法，但几乎没有成功

result={}
对话中的台词：
第一个标记=段落拆分（）[0]
如果第一个_token.endswith（'\n'）：#这将是名称
名称，行=段落拆分（在新行？）
name=name.strip（）
如果结果中没有名称：
结果[名称]=[]
结果[名称]。追加（行）
返回结果

这段代码给了我一大堆错误，所以我认为在这里列出它们是没有用的

理想情况下，我需要将每个字符作为字典中的第一个键，然后将它们的所有行作为数据

大概是这样的：

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

名称1:[第1行、第2行、第3行…] 名称2:[第1行、第2行、第3行…]

编辑：有些字符名有两个单词

编辑2：也许回到原始电影脚本文本文件会更容易

它的格式如下：

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

编辑回答：回到原始文件，如果我们可以假设所有字符名前面都有22个空白字符，我们可以这样做：

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.

example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

现在的结果是：

{'NAME1': ["Yo, Yo, good that you're here", 'man.', ''], 'NAME2': ['(Laughing)', "I don't think that's good!  We were", 'at the club, smoking, laughing -- doing', 'stuff.', '']}

这可能需要一些额外的清理

方法1:

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines

由“\n”拆分并剥离。列表的第一个元素是name，剩下的是您的行。str.pop将就地修改您的列表。如果您的对话有多行文字，此解决方案将不起作用

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> lines = list(map(str.strip, dialogue.split('\n')))
>>> lines
['NAME1', 'abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
>>> name = lines.pop(0)
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']

方法2:

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines

如果您有多行对话，即对话可能包含“\n”字符，则首先按“\n”字符的第一次出现进行拆分。第一个元素是name，下一个元素我们进一步用“16个空格”分割

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> parse_temp = dialogue.split('\n',1)
>>> name = parse_temp[0]
>>> lines = parse_temp[1].split(" " * 16)[1:]
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw\n', 'sadi waiudi qoweoq asodhoqndoqndqwdq.\n', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']

作为一种功能

def parse(dialogue):
    parse_temp = dialogue.split('\n',1)
    name = parse_temp[0].strip()
    lines = list(map(str.strip, parse_temp[1].split(" " * 16)[1:]))
    return name, lines

注意：对于第二次拆分，您可以用任何空白模式替换。你甚至可以用正则表达式将其拆分。我在这里使用了简单的16个空格。

请求迭代时添加的代码：

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines

拆分文本行
为每个参与者创建具有唯一键的dict
在dict中添加演员行

编辑：在名称正则表达式中添加空格，删除名称空白

import re
lines = [
    "Dialogue[0] 'NAME1 \n                YO, YO, good that you're here man.'",
    "Dialogue[1] 'NAME 1\n                YO, YO, ",
    "Dialogue[2] 'NAME2\n                YO, YO, good that ",
    "Dialogue[3] 'NAME2\n                YO, YO, good that you're here'",
]

regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
lineslist = [re.findall(regex, line) for line in lines]
lineslist = [ match[0] for match in lineslist if len(match)]
keys = [l[0].strip() for l in lineslist]
result = {k:[] for k in set(keys)}
[result[l[0].strip()].append(l[1]) for l in lineslist]
result

输出：

{'NAME 1': ['YO, YO, '],
 'NAME1': ["YO, YO, good that you're here man.'"],
 'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}

使用此正则表达式拆分数据：

”（[A-Z0-9]+）\\n[]{16}（+）

在转到原始电影脚本之前，请检查我的回答方法2，因为您似乎有多行对话。既然您已经知道如何在角色对话之间分割，这可能会起作用。在原始文本文件中，名称前面是否总是有几个空格？对话从一行的开头开始？谢谢你的回复。我得到了这个错误：ValueError：太多的值无法解包（预期为2）您能试试这个吗

name，line=item.split（maxslit=1）

给出了相同的错误消息：ValueError:没有足够的值来解包（预期为2，得到1）这不是相同的错误。。。但无论如何，我猜您的数据格式不一致，需要首先清理@fishmanI@fishmanI。我已经进行了编辑以显示原始脚本。你能解释一下我是如何从中得到同样的词典结果的吗？也许这样会更容易给我一个索引错误：indexer错误：列表索引超出范围这已经起作用了，但是有一个小错误，它显示一个字符名，后面有一个空格，作为一个不同的名称起作用了，谢谢！但我刚刚注意到另一个问题，每组行有4个副本。在中，有duplicates@fishman你所说的一组线是什么意思，你能举个例子吗？比如，它能找到所有的线，这很好。但是有四行代码：当运行函数时，它给了我一个错误：AttributeError:“list”对象没有属性“split”，你能把一个对话示例粘贴到pastebin函数中作为输入，并在这里放一个链接吗？啊，你误解了。我让你把它用在原来的问题上。i、 e.对话[0]我假设是一个字符的对话。我以为你已经有了这个逻辑，不是吗？所以我希望你在对话列表中的每一个对话中都使用这个，正如你在原始问题中提到的。我将如何在每一个对话项上运行这个函数？