Python 使用beautifulsoup从wikipedia表中获取列_Python_Python 3.x_Beautifulsoup_Html Parsing

Python 使用beautifulsoup从wikipedia表中获取列

python python-3.x

Python 使用beautifulsoup从wikipedia表中获取列,python,python-3.x,beautifulsoup,html-parsing,Python,Python 3.x,Beautifulsoup,Html Parsing,我正在尝试从“单曲列表”表中获取歌曲名称列表该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记作为主要艺术家的单曲列表，包括选定的图表位置、销售数据和认证我试过： source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography') soup = BeautifulSoup(source_code.text) tables = soup.find_all("ta

我正在尝试从“单曲列表”表中获取歌曲名称列表

该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记

作为主要艺术家的单曲列表，包括选定的图表位置、销售数据和认证

我试过：

source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

但是它什么也不返回，我假设标题不是bs4中可识别的标记？

这里有一个完整的例子，可以解决“泰勒-斯威夫特问题”。首先查找包含文本“单子列表”的标题，然后移动到父对象。然后迭代包含要查找文本的项目：

table = soup.find_all("caption")

这使得：

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

实际上，它与

findAll（）

无关，

findAll（）

在

BeautifulSoup3

中使用了

findAll（）

，出于兼容性原因，被留在

BeautifulSoup4

中，引用自

bs4

的源代码：

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...

还有一种更好的方法可以获取单子列表，它依赖于带有

id=“singles”

的

span

元素，该元素指示

singles

段落的开头。然后，使用获取

span

标记的父项后的第一个表。然后，使用

scope=“row”

获取所有

th

元素：

印刷品：

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

非常感谢！我以为findAll和find_都做了同样的事情，因为这个。@user2985522我想这取决于您使用的版本：。如果您使用的是较旧的BeautifulSoup，它是

find_all

，但是如果您使用的是bs4（您应该是这样的）这是

findAll

@Hooked:这不是倒过来的吗？我很确定

findAll

是旧的。@DSM，没错。我会相应地更新我的答案。

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"