Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/331.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用beautifulsoup从wikipedia表中获取列_Python_Python 3.x_Beautifulsoup_Html Parsing - Fatal编程技术网

Python 使用beautifulsoup从wikipedia表中获取列

Python 使用beautifulsoup从wikipedia表中获取列,python,python-3.x,beautifulsoup,html-parsing,Python,Python 3.x,Beautifulsoup,Html Parsing,我正在尝试从“单曲列表”表中获取歌曲名称列表 该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记 作为主要艺术家的单曲列表,包括选定的图表位置、销售数据和认证 我试过: source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography') soup = BeautifulSoup(source_code.text) tables = soup.find_all("ta

我正在尝试从“单曲列表”表中获取歌曲名称列表

该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记

作为主要艺术家的单曲列表,包括选定的图表位置、销售数据和认证

我试过:

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

但是它什么也不返回,我假设标题不是bs4中可识别的标记?

这里有一个完整的例子,可以解决“泰勒-斯威夫特问题”。首先查找包含文本“单子列表”的标题,然后移动到父对象。然后迭代包含要查找文本的项目:

table = soup.find_all("caption")
这使得:

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

实际上,它与
findAll()
无关,
findAll()
BeautifulSoup3
中使用了
findAll()
,出于兼容性原因,被留在
BeautifulSoup4
中,引用自
bs4
的源代码:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...

还有一种更好的方法可以获取单子列表,它依赖于带有
id=“singles”
span
元素,该元素指示
singles
段落的开头。然后,使用获取
span
标记的父项后的第一个表。然后,使用
scope=“row”
获取所有
th
元素:

印刷品:

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

非常感谢!我以为findAll和find_都做了同样的事情,因为这个。@user2985522我想这取决于您使用的版本:。如果您使用的是较旧的BeautifulSoup,它是
find_all
,但是如果您使用的是bs4(您应该是这样的)这是
findAll
@Hooked:这不是倒过来的吗?我很确定
findAll
是旧的。@DSM,没错。我会相应地更新我的答案。
from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)
"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"