Python 无法将辩论者之间的对话拼凑成字典_Python_Python 3.x_Web Scraping

Python 无法将辩论者之间的对话拼凑成字典

python python-3.x web-scraping

Python 无法将辩论者之间的对话拼凑成字典,python,python-3.x,web-scraping,Python,Python 3.x,Web Scraping,我创建了一个脚本来获取不同辩论者之间的所有对话，不包括主持人。到目前为止，我所写的内容可以吸引整个对话。然而，我想抓住它们，比如{speaker\u name:（第一次演讲，第二次演讲）等等} 另一个类似于上面的链接到目前为止，我已经尝试过： import requests from bs4 import BeautifulSoup url = 'https://www.presidency.ucsb.edu/documents/presidential-debate-the-unive

我创建了一个脚本来获取不同辩论者之间的所有对话，不包括主持人。到目前为止，我所写的内容可以吸引整个对话。然而，我想抓住它们，比如

{speaker\u name:（第一次演讲，第二次演讲）等等}

另一个类似于上面的链接

到目前为止，我已经尝试过：

import requests
from bs4 import BeautifulSoup

url = 'https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas'

def get_links(link):
    r = requests.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select(".field-docs-content p:has( > strong:contains('MODERATOR:')) ~ p"):
        print(item.text)

if __name__ == '__main__':
    get_links(url)

我怎样才能把辩论者之间的对话刮下来放到字典里去呢？

鉴于我看到的两页之间的差异以及我不得不做出的大量假设，我不太希望这种对话能持续很多页。本质上，我在参与者和版主节点文本上使用正则表达式来隔离版主和参与者的列表。然后，我循环所有演讲段落，每次我在段落开头遇到主持人时，我设置一个布尔变量

store\u paragration=False

，并忽略后续段落；同样，每次遇到参与者时，我都会设置

store_paragration=True

，并将该段落和后续段落存储在我的

speaker_dict

中相应的参与者键下。我将每个

演讲者的口述

存储在最终的

结果

字典中

import requests, re
from bs4 import BeautifulSoup as bs
import pprint

links = ['https://www.presidency.ucsb.edu/documents/presidential-debate-the-university-nevada-las-vegas','https://www.presidency.ucsb.edu/documents/republican-presidential-candidates-debate-manchester-new-hampshire-0']
results = {}
p = re.compile(r'\b(\w+)\b\s+\(|\b(\w+)\b,')

with requests.Session() as s:
    for number, link in enumerate(links):
        r = s.get(link)
        soup = bs(r.content,'lxml')
        participants_tag = soup.select_one('p:has(strong:contains("PARTICIPANTS:"))')

        if participants_tag.select_one('strong'):
            participants_tag.strong.decompose()
        speaker_dict = {i[0].upper() + ':' if i[0] else i[1].upper() + ':': [] for string in participants_tag.stripped_strings for i in p.findall(string)}
        # print(speaker_dict)
        moderator_data = [string for string in soup.select_one('p:has(strong:contains("MODERATOR:","MODERATORS:"))').stripped_strings][1:]
        #print(moderator_data)
        moderators = [i[0].upper() + ':' if i[0] else i[1].upper() + ':' for string in moderator_data for i in p.findall(string)]
        store_paragraph = False

        for paragraph in soup.select('.field-docs-content p:not(p:contains("PARTICIPANTS:","MODERATOR:"))')[1:]:
            string_to_compare = paragraph.text.split(':')[0] + ':'
            string_to_compare = string_to_compare.upper()
            if string_to_compare in moderators:
                store_paragraph = False
            elif string_to_compare in speaker_dict:
                speaker = string_to_compare
                store_paragraph = True
            if store_paragraph:
                speaker_dict[speaker].append(paragraph.text)
        results[number] = speaker_dict

pprint.pprint(results[1])

您能分享真实数据的预期输出吗？您是否希望关键是演讲者姓名，而值是他们所说的所有内容的列表？例如：

{'Brown'：['Good Night'，'how are']}

好的，我编辑了这个问题，以澄清预期的输出可能是什么样子。谢谢。我怎么会错过这个答案呢？测试完成后我会告诉你的。感谢您抽出时间@QHarr。