Html 使用BeautifulSoup消除嵌套TD中的跨度元素_Html_Css_Parsing_Web Scraping_Beautifulsoup

Html 使用BeautifulSoup消除嵌套TD中的跨度元素

html css parsing web-scraping

Html 使用BeautifulSoup消除嵌套TD中的跨度元素,html,css,parsing,web-scraping,beautifulsoup,Html,Css,Parsing,Web Scraping,Beautifulsoup,我对网络垃圾很陌生，所以我写了一个小脚本从这个网站上提取玩家的分数：代码如下： import urllib2 from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen("http://www.fold.it/portal/players").read() for row in soup('tr', {'class':'even'}): rank = row('td')[0].string td2 = row

我对网络垃圾很陌生，所以我写了一个小脚本从这个网站上提取玩家的分数：

代码如下：

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("http://www.fold.it/portal/players").read()

for row in soup('tr', {'class':'even'}):
  rank = row('td')[0].string
  td2 = row('td')[1]
  for name in td2('a'):
     user = name.text
  score = row('td')[2].string

print rank, user, score

现在，除了用户的名字中还有另外两个分数外，这项功能非常有效。看看html，在a href之后似乎有两个span元素。

我的第一个想法是在空白处拆分“user”，但有些名称中有空格，所以这不起作用。我也考虑过寻找数字，但有些用户也有数字名称

我认为消除跨度是我最好的选择。然而，我不确定解析它们的最佳方法是什么。任何帮助都将不胜感激

分数在单独的

span

标签中-使用它：

for row in soup('tr', {'class': 'even'}):
    cells = row('td')
    rank = cells[0].string

    # finding the first text node - this is our name
    name = cells[1].a.find(text=True).strip()

    # ranks are in two separate `span` tags
    rank1, rank2 = cells[1].find_all("span")

    print name, rank1.text, rank2.text

印刷品：

Galaxie 1 3
smilingone 2 35
LociOiling 3 9
Desnouck Maarten 4 153
...

谢谢工作得很好！