Python 当表格单元格采用混合格式时删除Wikipedia信息框_Python_Web Scraping_Beautifulsoup_Wikipedia

Python 当表格单元格采用混合格式时删除Wikipedia信息框

python web-scraping

Python 当表格单元格采用混合格式时删除Wikipedia信息框,python,web-scraping,beautifulsoup,wikipedia,Python,Web Scraping,Beautifulsoup,Wikipedia,我正试图从维基百科的信息框中获取一些关键字的信息。例如：假设我正在寻找制造商的值。我希望他们在一个列表中，我只想要他们的文本。因此，在这种情况下，所需的输出将是['Keurig Dr Pepper（美国，世界各地）'，'A&W Canada（加拿大）']。无论我尝试什么，都无法成功生成此列表。下面是我的一段代码： url = "https://en.wikipedia.org/wiki/ABC_Studios" soup = BeautifulSoup(requests.get(url),

我正试图从维基百科的信息框中获取一些关键字的信息。例如：

假设我正在寻找制造商的值。我希望他们在一个列表中，我只想要他们的文本。因此，在这种情况下，所需的输出将是

['Keurig Dr Pepper（美国，世界各地）'，'A&W Canada（加拿大）']

。无论我尝试什么，都无法成功生成此列表。下面是我的一段代码：

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text

我想有一个方法，可以在各种情况下工作：当有断行的方式，当一些值是链接，当一些值是段落，等等。在所有情况下，我只希望我们在屏幕上看到的文本，而不是链接，不是段落，只是纯文本。我也不希望输出是

Keurig Dr Pepper（美国，世界各地）A&W Canada（加拿大）

，因为稍后我希望能够解析结果并对每个实体进行处理

我正在浏览很多维基百科页面，但我找不到一种方法可以很好地解决其中的大部分问题。你能帮我处理工作代码吗？我不擅长刮削。

好的，下面是我的尝试（json库只是为了很好地打印字典）：

代码将

标记替换为

\n

，这将提供：

{
 "Trading name": "ABC Studios",
 "Type": "Subsidiary\nLimited liability company",
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
 "Website": "abcstudios.go.com"
}

{
 "Trading name": "ABC Studios",
 "Type": [
  "Subsidiary",
  "Limited liability company"
 ],
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": [
  "ABC Entertainment Group",
  "(Disney\u2013ABC Television Group)"
 ],
 "Website": "abcstudios.go.com"
}

如果要返回列表而不是带有

\n

s的字符串，可以对其进行调整

    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

此代码将不起作用

soup = BeautifulSoup(requests.get(url), "lxml")

BeautifulSoup需要

请求

内容，追加

文本

或

内容

要获得预期的生产结果，您需要在

td[class=“brand”]

中选择

元素，然后使用

.next\u sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']

你的回答接近我想要的。谢谢看见

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']