Python 解析剧本_Python_Regex_Parsing

Python 解析剧本

python regex parsing

Python 解析剧本,python,regex,parsing,Python,Regex,Parsing,嗨，我正试图解析一个剧本，并试图用正则表达式捕捉（名字）：（对话）。到目前为止，在regex 101上，我有re.complie（“（\w+）\n（.*）”，但正如您在图像中看到的那样，对于包含特殊字符的某些行，它是平坦的。感谢您的帮助。（添加文本格式以帮助再现性）我马上就看到一些问题：您没有利用角色名都是大写的事实。在正则表达式中使用[A-Z]+而不是\w+ 您没有使用SingleLineregex选项，因此这会阻止匹配多行由于@Wiktor Stribiżew的模式，实现所需结果的完

嗨，我正试图解析一个剧本，并试图用正则表达式捕捉（名字）：（对话）。到目前为止，在regex 101上，我有

re.complie（“（\w+）\n（.*）”

，但正如您在图像中看到的那样，对于包含特殊字符的某些行，它是平坦的。感谢您的帮助。（添加文本格式以帮助再现性）

我马上就看到一些问题：

您没有利用角色名都是大写的事实。在正则表达式中使用
```
[A-Z]+
```
而不是
```
\w+
```
您没有使用
```
SingleLine
```
regex选项，因此这会阻止
匹配多行

由于@Wiktor Stribiżew的模式，实现所需结果的完整代码

# Grouped regex pattern to capture char and dialouge in a tuple
char_dialogue = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
extract_dialogue = char_dialogue.findall(script)

final_dict = {}

for element in extract_dialogue:
   # Seperating the character and dialogue from the tuple
   char = element[0]
   line = element[1]
   # If the char is already a key in the dictionary
   # and line is not empty append the dialogue to the value list
   if char in final_dict:
       if line != '':
           final_dict[char].append(line)
   else:
       # Else add the character name to the dictionary keys with their first line
       # Drop any lower case matches from group 0
       # Can adjust the len here if you have characters with fewer letters
       if char.isupper() and len(char) >2:
           final_dict[char] = [line]

        
# Some final cleaning to drop empty dalouge 

final_dict = {k: v for k, v in final_dict.items() if v  != ['']}

# More filtering to reutrn only main characters with more than 50 
# lines of dialogue 

final_dict = {k: v for k, v in final_dict.items() if len(v) > 50}

你可以用

(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)

看

详细信息：

```
（？m）^
```
-行的开头（
```
（？m）
```
与
```
re.m
```
选项相同）
```
\s*
```
-零个或多个空格
```
\b（[A-Z]+）\b
```
-第1组：大写整字（
```
\b
```
是字的边界）
```
\s*
```
```
\n
```
-换行符
```
（.**:\n.+）*）
```
-第2组：行的其余部分，然后是换行的零个或多个序列，然后是行的其余部分（因此，任何文本直到第一个空行）

见：

重新导入
rx=r“^\s*\b（[A-Z]+）\b\s*\n（.*（：\n.+）*）”
text=“CLAIRE\n早上好，美女。\n\n Caitlin咕哝了一声，滚到她的肚子上。\n\n CLAIRE\n我们走吧。否则我们永远不会准时离开。\n\n从枕头里传来Caitlin的声音。\n\n Caitlin\n（低沉）\n我完全准备好了。\n\n克莱尔环顾四周，看了看成堆的未打包的衣服。\n\n克莱尔\n来吧，我给你做些华夫饼干，\n也许我们会挤出时间去购物中心。\n（节拍）\n凯特琳…”
打印（关于findall（rx，text，re.M））

输出：

[
  ('CLAIRE', '      Morning, beauty.'),
  ('CLAIRE', "      Let's go.  Or we'll never leave on time."),
  ('CAITLIN', "           (muffled)\n      I'm totally ready."),
  ('CLAIRE', "      Come on, I'll make you some waffles,\n      maybe we'll squeeze in a trip to the\n      mall.\n           (beat)\n      Caitlin...")
]

是的，我已经准备好了[A-Z]+（现在还在修补它），我将研究一下我以前从未听说过的单线正则表达式选项。感谢您的洞察力。您尚未解释您期望的输出。

（？m）^\s*\b（[A-Z]+）\b\s*\n（.*（？：\n.+）*）

对你有用吗？看，谢谢，@WiktorStribiżew太完美了，它捕捉到了第一组的名字和第二组的所有对话。我想要的输出是一个字典，其中键是第一组，值是第二组。我在OP中使用了这个模式，并编写了代码来实现这一点。现在有了你的图案，非常完美，非常感谢。我会发布完整的答案，以防其他有需要的人偶然发现。

[
  ('CLAIRE', '      Morning, beauty.'),
  ('CLAIRE', "      Let's go.  Or we'll never leave on time."),
  ('CAITLIN', "           (muffled)\n      I'm totally ready."),
  ('CLAIRE', "      Come on, I'll make you some waffles,\n      maybe we'll squeeze in a trip to the\n      mall.\n           (beat)\n      Caitlin...")
]