Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 当表格单元格采用混合格式时删除Wikipedia信息框_Python_Web Scraping_Beautifulsoup_Wikipedia - Fatal编程技术网

Python 当表格单元格采用混合格式时删除Wikipedia信息框

Python 当表格单元格采用混合格式时删除Wikipedia信息框,python,web-scraping,beautifulsoup,wikipedia,Python,Web Scraping,Beautifulsoup,Wikipedia,我正试图从维基百科的信息框中获取一些关键字的信息。例如: 假设我正在寻找制造商的值。我希望他们在一个列表中,我只想要他们的文本。因此,在这种情况下,所需的输出将是['Keurig Dr Pepper(美国,世界各地)','A&W Canada(加拿大)']。 无论我尝试什么,都无法成功生成此列表。下面是我的一段代码: url = "https://en.wikipedia.org/wiki/ABC_Studios" soup = BeautifulSoup(requests.get(url),

我正试图从维基百科的信息框中获取一些关键字的信息。例如:

假设我正在寻找制造商的值。我希望他们在一个列表中,我只想要他们的文本。因此,在这种情况下,所需的输出将是
['Keurig Dr Pepper(美国,世界各地)','A&W Canada(加拿大)']
。 无论我尝试什么,都无法成功生成此列表。下面是我的一段代码:

url = "https://en.wikipedia.org/wiki/ABC_Studios"
soup = BeautifulSoup(requests.get(url), "lxml")
tbl = soup.find("table", {"class": "infobox vcard"})
list_of_table_rows = tbl.findAll('tr')
for tr in list_of_table_rows:

        th = tr.find("th")
        td = tr.find("td")

    # take th.text and td.text
我想有一个方法,可以在各种情况下工作:当有断行的方式,当一些值是链接,当一些值是段落,等等。在所有情况下,我只希望我们在屏幕上看到的文本,而不是链接,不是段落,只是纯文本。我也不希望输出是
Keurig Dr Pepper(美国,世界各地)A&W Canada(加拿大)
,因为稍后我希望能够解析结果并对每个实体进行处理


我正在浏览很多维基百科页面,但我找不到一种方法可以很好地解决其中的大部分问题。你能帮我处理工作代码吗?我不擅长刮削。

好的,下面是我的尝试(json库只是为了很好地打印字典):

代码将

标记替换为
\n
,这将提供:

{
 "Trading name": "ABC Studios",
 "Type": "Subsidiary\nLimited liability company",
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": "ABC Entertainment Group\n(Disney\u2013ABC Television Group)",
 "Website": "abcstudios.go.com"
}
{
 "Trading name": "ABC Studios",
 "Type": [
  "Subsidiary",
  "Limited liability company"
 ],
 "Industry": "Television production",
 "Predecessor": "Touchstone Television",
 "Founded": "March\u00a021, 1985; 33 years ago(1985-03-21)",
 "Headquarters": "Burbank, California,U.S.",
 "Area served": "Worldwide",
 "Key people": "Patrick Moran (President)",
 "Parent": [
  "ABC Entertainment Group",
  "(Disney\u2013ABC Television Group)"
 ],
 "Website": "abcstudios.go.com"
}
如果要返回列表而不是带有
\n
s的字符串,可以对其进行调整

    innerTextList = innerText.split("\n")
    if len(innerTextList) < 2:
        info[th.text] = innerTextList[0]
    else:
        info[th.text] = innerTextList

此代码将不起作用

soup = BeautifulSoup(requests.get(url), "lxml")
BeautifulSoup需要
请求
内容,追加
文本
内容

要获得预期的生产结果,您需要在
td[class=“brand”]
中选择
a
元素,然后使用
.next\u sibling.string

html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']

你的回答接近我想要的。谢谢看见
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
result = soup.select('td[class="brand"] a')
manufacturer = [a.text + a.next_sibling.string for a in result]
print(manufacturer)
# ['Keurig Dr Pepper (United States, Worldwide)', 'A&W Canada (Canada)']