Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/288.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从数组/列表中将字符名及其行添加到新字典_Python_Regex_Text_Nltk_Analysis - Fatal编程技术网

Python 从数组/列表中将字符名及其行添加到新字典

Python 从数组/列表中将字符名及其行添加到新字典,python,regex,text,nltk,analysis,Python,Regex,Text,Nltk,Analysis,我有一个电影剧本。我的第一项工作是收集字典中每个字符的行 稍后我需要将数据放入一个系列中 现在,我把所有的对话都列在一个列表中,从角色名开始。它的格式如下: NAME1 Yo, Yo, good that you're here man. NAME2 (Laughing) I don't think that's good! We were at the club, smoking, laughing -- doing stuff. exa

我有一个电影剧本。我的第一项工作是收集字典中每个字符的行

稍后我需要将数据放入一个系列中

现在,我把所有的对话都列在一个列表中,从角色名开始。它的格式如下:

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.
example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())
对话[0] 'NAME1\n(16个空格)哟,哟,很高兴你在这里,伙计。'

所有名称都以\n结尾。然后所有的对话都以16个空格开始。我认为这可能是有用的,但我不知道如何利用这一点

我试过很多方法,但几乎没有成功

result={}
对话中的台词:
第一个标记=段落拆分()[0]
如果第一个_token.endswith('\n'):#这将是名称
名称,行=段落拆分(在新行?)
name=name.strip()
如果结果中没有名称:
结果[名称]=[]
结果[名称]。追加(行)
返回结果
这段代码给了我一大堆错误,所以我认为在这里列出它们是没有用的

理想情况下,我需要将每个字符作为字典中的第一个键,然后将它们的所有行作为数据

大概是这样的:

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.
example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())
名称1:[第1行、第2行、第3行…] 名称2:[第1行、第2行、第3行…]

编辑: 有些字符名有两个单词

编辑2: 也许回到原始电影脚本文本文件会更容易

它的格式如下:

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.
example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())

编辑回答:回到原始文件,如果我们可以假设所有字符名前面都有22个空白字符,我们可以这样做:

          NAME1
Yo, Yo, good that you're here
man.

          NAME2
     (Laughing)
I don't think that's good!  We were
at the club, smoking, laughing -- doing
stuff.
example = """
                      NAME1
            Yo, Yo, good that you're here
            man.

                      NAME2
                 (Laughing)
            I don't think that's good!  We were
            at the club, smoking, laughing -- doing
            stuff.
"""

lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
    if line in characters:
        current = line.strip()
    elif current:
        result[current].append(line.strip())
现在的结果是:

{'NAME1': ["Yo, Yo, good that you're here", 'man.', ''], 'NAME2': ['(Laughing)', "I don't think that's good!  We were", 'at the club, smoking, laughing -- doing', 'stuff.', '']}

这可能需要一些额外的清理

方法1:

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines
由“\n”拆分并剥离。列表的第一个元素是name,剩下的是您的行。str.pop将就地修改您的列表。 如果您的对话有多行文字,此解决方案将不起作用

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> lines = list(map(str.strip, dialogue.split('\n')))
>>> lines
['NAME1', 'abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
>>> name = lines.pop(0)
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
方法2:

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines
如果您有多行对话,即对话可能包含“\n”字符,则首先按“\n”字符的第一次出现进行拆分。第一个元素是name,下一个元素我们进一步用“16个空格”分割

>>> dialogue
'NAME1\n                abc adbaiuho saidainbw\n                sadi waiudi qoweoq asodhoqndoqndqwdq.\n                qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> parse_temp = dialogue.split('\n',1)
>>> name = parse_temp[0]
>>> lines = parse_temp[1].split(" " * 16)[1:]
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw\n', 'sadi waiudi qoweoq asodhoqndoqndqwdq.\n', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
作为一种功能

def parse(dialogue):
    parse_temp = dialogue.split('\n',1)
    name = parse_temp[0].strip()
    lines = list(map(str.strip, parse_temp[1].split(" " * 16)[1:]))
    return name, lines
注意:对于第二次拆分,您可以用任何空白模式替换。你甚至可以用正则表达式将其拆分。我在这里使用了简单的16个空格。

请求迭代时添加的代码:

data = dict()
for _dialogue in dialogue:
   name, lines = parse(_dialogue)
   data[name] = data.get(name, list()) + lines
  • 拆分文本行
  • 为每个参与者创建具有唯一键的dict
  • 在dict中添加演员行
编辑:在名称正则表达式中添加空格,删除名称空白

import re
lines = [
    "Dialogue[0] 'NAME1 \n                YO, YO, good that you're here man.'",
    "Dialogue[1] 'NAME 1\n                YO, YO, ",
    "Dialogue[2] 'NAME2\n                YO, YO, good that ",
    "Dialogue[3] 'NAME2\n                YO, YO, good that you're here'",
]

regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
lineslist = [re.findall(regex, line) for line in lines]
lineslist = [ match[0] for match in lineslist if len(match)]
keys = [l[0].strip() for l in lineslist]
result = {k:[] for k in set(keys)}
[result[l[0].strip()].append(l[1]) for l in lineslist]
result
输出:

{'NAME 1': ['YO, YO, '],
 'NAME1': ["YO, YO, good that you're here man.'"],
 'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}


使用此正则表达式拆分数据:
”([A-Z0-9]+)\\n[]{16}(+)
在转到原始电影脚本之前,请检查我的回答方法2,因为您似乎有多行对话。既然您已经知道如何在角色对话之间分割,这可能会起作用。在原始文本文件中,名称前面是否总是有几个空格?对话从一行的开头开始?谢谢你的回复。我得到了这个错误:ValueError:太多的值无法解包(预期为2)您能试试这个吗
name,line=item.split(maxslit=1)
给出了相同的错误消息:ValueError:没有足够的值来解包(预期为2,得到1)这不是相同的错误。。。但无论如何,我猜您的数据格式不一致,需要首先清理@fishmanI@fishmanI。我已经进行了编辑以显示原始脚本。你能解释一下我是如何从中得到同样的词典结果的吗?也许这样会更容易给我一个索引错误:indexer错误:列表索引超出范围这已经起作用了,但是有一个小错误,它显示一个字符名,后面有一个空格,作为一个不同的名称起作用了,谢谢!但我刚刚注意到另一个问题,每组行有4个副本。在中,有duplicates@fishman你所说的一组线是什么意思,你能举个例子吗?比如,它能找到所有的线,这很好。但是有四行代码:当运行函数时,它给了我一个错误:AttributeError:“list”对象没有属性“split”,你能把一个对话示例粘贴到pastebin函数中作为输入,并在这里放一个链接吗?啊,你误解了。我让你把它用在原来的问题上。i、 e.对话[0]我假设是一个字符的对话。我以为你已经有了这个逻辑,不是吗?所以我希望你在对话列表中的每一个对话中都使用这个,正如你在原始问题中提到的。我将如何在每一个对话项上运行这个函数?