Python 使用beautifulsoup从wikipedia表中获取列
我正在尝试从“单曲列表”表中获取歌曲名称列表 该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记 作为主要艺术家的单曲列表,包括选定的图表位置、销售数据和认证 我试过:Python 使用beautifulsoup从wikipedia表中获取列,python,python-3.x,beautifulsoup,html-parsing,Python,Python 3.x,Beautifulsoup,Html Parsing,我正在尝试从“单曲列表”表中获取歌曲名称列表 该表没有唯一的类或id。我能想到的唯一唯一唯一的事情是“单曲列表…”周围的标题标记 作为主要艺术家的单曲列表,包括选定的图表位置、销售数据和认证 我试过: source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography') soup = BeautifulSoup(source_code.text) tables = soup.find_all("ta
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")
但是它什么也不返回,我假设标题不是bs4中可识别的标记?这里有一个完整的例子,可以解决“泰勒-斯威夫特问题”。首先查找包含文本“单子列表”的标题,然后移动到父对象。然后迭代包含要查找文本的项目:
table = soup.find_all("caption")
这使得:
for caption in soup.findAll("caption"):
if "List of singles" in caption.text:
break
table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
print item.text
实际上,它与
findAll()
无关,findAll()
在BeautifulSoup3
中使用了findAll()
,出于兼容性原因,被留在BeautifulSoup4
中,引用自bs4
的源代码:
"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...
还有一种更好的方法可以获取单子列表,它依赖于带有
id=“singles”
的span
元素,该元素指示singles
段落的开头。然后,使用获取span
标记的父项后的第一个表。然后,使用scope=“row”
获取所有th
元素:
印刷品:
from bs4 import BeautifulSoup
import requests
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)
table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
print(single.text)
非常感谢!我以为findAll和find_都做了同样的事情,因为这个。@user2985522我想这取决于您使用的版本:。如果您使用的是较旧的BeautifulSoup,它是
find_all
,但是如果您使用的是bs4(您应该是这样的)这是findAll
@Hooked:这不是倒过来的吗?我很确定findAll
是旧的。@DSM,没错。我会相应地更新我的答案。
from bs4 import BeautifulSoup
import requests
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)
table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
print(single.text)
"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"