Python 从数组/列表中将字符名及其行添加到新字典
我有一个电影剧本。我的第一项工作是收集字典中每个字符的行 稍后我需要将数据放入一个系列中 现在,我把所有的对话都列在一个列表中,从角色名开始。它的格式如下:Python 从数组/列表中将字符名及其行添加到新字典,python,regex,text,nltk,analysis,Python,Regex,Text,Nltk,Analysis,我有一个电影剧本。我的第一项工作是收集字典中每个字符的行 稍后我需要将数据放入一个系列中 现在,我把所有的对话都列在一个列表中,从角色名开始。它的格式如下: NAME1 Yo, Yo, good that you're here man. NAME2 (Laughing) I don't think that's good! We were at the club, smoking, laughing -- doing stuff. exa
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
example = """
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
"""
lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
if line in characters:
current = line.strip()
elif current:
result[current].append(line.strip())
对话[0]
'NAME1\n(16个空格)哟,哟,很高兴你在这里,伙计。'
所有名称都以\n结尾。然后所有的对话都以16个空格开始。我认为这可能是有用的,但我不知道如何利用这一点
我试过很多方法,但几乎没有成功
result={}
对话中的台词:
第一个标记=段落拆分()[0]
如果第一个_token.endswith('\n'):#这将是名称
名称,行=段落拆分(在新行?)
name=name.strip()
如果结果中没有名称:
结果[名称]=[]
结果[名称]。追加(行)
返回结果
这段代码给了我一大堆错误,所以我认为在这里列出它们是没有用的
理想情况下,我需要将每个字符作为字典中的第一个键,然后将它们的所有行作为数据
大概是这样的:
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
example = """
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
"""
lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
if line in characters:
current = line.strip()
elif current:
result[current].append(line.strip())
名称1:[第1行、第2行、第3行…]
名称2:[第1行、第2行、第3行…]
编辑:
有些字符名有两个单词
编辑2:
也许回到原始电影脚本文本文件会更容易
它的格式如下:
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
example = """
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
"""
lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
if line in characters:
current = line.strip()
elif current:
result[current].append(line.strip())
编辑回答:回到原始文件,如果我们可以假设所有字符名前面都有22个空白字符,我们可以这样做:
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
example = """
NAME1
Yo, Yo, good that you're here
man.
NAME2
(Laughing)
I don't think that's good! We were
at the club, smoking, laughing -- doing
stuff.
"""
lines = example.split('\n')
characters = [line for line in lines if line.startswith(' ' * 22)]
result = {c.strip(): [] for c in characters}
current = ''
for line in lines:
if line in characters:
current = line.strip()
elif current:
result[current].append(line.strip())
现在的结果是:
{'NAME1': ["Yo, Yo, good that you're here", 'man.', ''], 'NAME2': ['(Laughing)', "I don't think that's good! We were", 'at the club, smoking, laughing -- doing', 'stuff.', '']}
这可能需要一些额外的清理方法1:
data = dict()
for _dialogue in dialogue:
name, lines = parse(_dialogue)
data[name] = data.get(name, list()) + lines
由“\n”拆分并剥离。列表的第一个元素是name,剩下的是您的行。str.pop将就地修改您的列表。
如果您的对话有多行文字,此解决方案将不起作用
>>> dialogue
'NAME1\n abc adbaiuho saidainbw\n sadi waiudi qoweoq asodhoqndoqndqwdq.\n qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> lines = list(map(str.strip, dialogue.split('\n')))
>>> lines
['NAME1', 'abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
>>> name = lines.pop(0)
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw', 'sadi waiudi qoweoq asodhoqndoqndqwdq.', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
方法2:
data = dict()
for _dialogue in dialogue:
name, lines = parse(_dialogue)
data[name] = data.get(name, list()) + lines
如果您有多行对话,即对话可能包含“\n”字符,则首先按“\n”字符的第一次出现进行拆分。第一个元素是name,下一个元素我们进一步用“16个空格”分割
>>> dialogue
'NAME1\n abc adbaiuho saidainbw\n sadi waiudi qoweoq asodhoqndoqndqwdq.\n qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!'
>>> parse_temp = dialogue.split('\n',1)
>>> name = parse_temp[0]
>>> lines = parse_temp[1].split(" " * 16)[1:]
>>> name
'NAME1'
>>> lines
['abc adbaiuho saidainbw\n', 'sadi waiudi qoweoq asodhoqndoqndqwdq.\n', 'qiudwqd aisdiqnd asfiqwofnqofoweqomdomkmq!!']
作为一种功能
def parse(dialogue):
parse_temp = dialogue.split('\n',1)
name = parse_temp[0].strip()
lines = list(map(str.strip, parse_temp[1].split(" " * 16)[1:]))
return name, lines
注意:对于第二次拆分,您可以用任何空白模式替换。你甚至可以用正则表达式将其拆分。我在这里使用了简单的16个空格。
请求迭代时添加的代码:
data = dict()
for _dialogue in dialogue:
name, lines = parse(_dialogue)
data[name] = data.get(name, list()) + lines
- 拆分文本行
- 为每个参与者创建具有唯一键的dict
- 在dict中添加演员行
import re
lines = [
"Dialogue[0] 'NAME1 \n YO, YO, good that you're here man.'",
"Dialogue[1] 'NAME 1\n YO, YO, ",
"Dialogue[2] 'NAME2\n YO, YO, good that ",
"Dialogue[3] 'NAME2\n YO, YO, good that you're here'",
]
regex = h = re.compile("'([A-Z 0-9]+)\n[ ]{16}(.+)")
lineslist = [re.findall(regex, line) for line in lines]
lineslist = [ match[0] for match in lineslist if len(match)]
keys = [l[0].strip() for l in lineslist]
result = {k:[] for k in set(keys)}
[result[l[0].strip()].append(l[1]) for l in lineslist]
result
输出:
{'NAME 1': ['YO, YO, '],
'NAME1': ["YO, YO, good that you're here man.'"],
'NAME2': ['YO, YO, good that ', "YO, YO, good that you're here'"]}
使用此正则表达式拆分数据:
”([A-Z0-9]+)\\n[]{16}(+)
在转到原始电影脚本之前,请检查我的回答方法2,因为您似乎有多行对话。既然您已经知道如何在角色对话之间分割,这可能会起作用。在原始文本文件中,名称前面是否总是有几个空格?对话从一行的开头开始?谢谢你的回复。我得到了这个错误:ValueError:太多的值无法解包(预期为2)您能试试这个吗name,line=item.split(maxslit=1)
给出了相同的错误消息:ValueError:没有足够的值来解包(预期为2,得到1)这不是相同的错误。。。但无论如何,我猜您的数据格式不一致,需要首先清理@fishmanI@fishmanI。我已经进行了编辑以显示原始脚本。你能解释一下我是如何从中得到同样的词典结果的吗?也许这样会更容易给我一个索引错误:indexer错误:列表索引超出范围这已经起作用了,但是有一个小错误,它显示一个字符名,后面有一个空格,作为一个不同的名称起作用了,谢谢!但我刚刚注意到另一个问题,每组行有4个副本。在中,有duplicates@fishman你所说的一组线是什么意思,你能举个例子吗?比如,它能找到所有的线,这很好。但是有四行代码:当运行函数时,它给了我一个错误:AttributeError:“list”对象没有属性“split”,你能把一个对话示例粘贴到pastebin函数中作为输入,并在这里放一个链接吗?啊,你误解了。我让你把它用在原来的问题上。i、 e.对话[0]我假设是一个字符的对话。我以为你已经有了这个逻辑,不是吗?所以我希望你在对话列表中的每一个对话中都使用这个,正如你在原始问题中提到的。我将如何在每一个对话项上运行这个函数?