Python 2.7 使用Beautifulsoup 4进行网络垃圾清理

Python 2.7 使用Beautifulsoup 4进行网络垃圾清理,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,下面是直接从espncricinfo.com获取的div标签 我想删除上面的html文件: from bs4 import BeautifulSoup import os import urllib2 BASE_URL = "http://www.espncricinfo.com" espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6") soup = Beau

下面是直接从espncricinfo.com获取的div标签


我想删除上面的html文件:

from bs4 import BeautifulSoup
import os
import urllib2
BASE_URL = "http://www.espncricinfo.com"
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6")

soup = BeautifulSoup(espn_ , 'html.parser')

#print soup.prettify().encode('utf-8')
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"})
for row in t20:
 print(row.find('tr' , {"class":"odd"}))
让我们假设我从上面给定的url获取了代码。当我刮的时候,我得到的输出是无


即使当我打印t20时,我也没有得到完整的输出,它只显示到JJ Bumrah,即只有第一个
标记。如果您对上述数据不清楚,请访问espn\ux中提供的url。选择TeamIndia并转到t20选项卡。我想删除我们在t20选项卡下看到的所有玩家的href链接。

html严重受损,您只需查看表格的前几行即可看到。您最好的选择是使用lxml或html5lib作为解析器,只需直接查找锚点并按步骤进行切片:

soup = BeautifulSoup(espn_.content , 'html5lib')

t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a")
for a in t20[1::2]:
   print(a)
这给了你:

<a href="/ci/content/player/27223.html">STR Binny</a>
<a href="/ci/content/player/290727.html">R Dhawan</a>
<a href="/ci/content/player/28671.html">FY Fazal</a>
<a href="/ci/content/player/290716.html">KM Jadhav</a>
<a href="/ci/content/player/326016.html">B Kumar</a>
<a href="/ci/content/player/481896.html">Mohammed Shami</a>
<a href="/ci/content/player/32540.html">CA Pujara</a>
<a href="/ci/content/player/33141.html">AT Rayudu</a>
<a href="/ci/content/player/34102.html">RG Sharma</a>
<a href="/ci/content/player/237095.html">M Vijay</a>