Python 2.7 使用Beautifulsoup 4进行网络垃圾清理_Python 2.7_Web Scraping_Beautifulsoup

Python 2.7 使用Beautifulsoup 4进行网络垃圾清理

python-2.7 web-scraping

Python 2.7 使用Beautifulsoup 4进行网络垃圾清理,python-2.7,web-scraping,beautifulsoup,Python 2.7,Web Scraping,Beautifulsoup,下面是直接从espncricinfo.com获取的div标签我想删除上面的html文件： from bs4 import BeautifulSoup import os import urllib2 BASE_URL = "http://www.espncricinfo.com" espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6") soup = Beau

下面是直接从espncricinfo.com获取的div标签

我想删除上面的html文件：

from bs4 import BeautifulSoup
import os
import urllib2
BASE_URL = "http://www.espncricinfo.com"
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6")

soup = BeautifulSoup(espn_ , 'html.parser')

#print soup.prettify().encode('utf-8')
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"})
for row in t20:
 print(row.find('tr' , {"class":"odd"}))

让我们假设我从上面给定的url获取了代码。当我刮的时候，我得到的输出是无

即使当我打印t20时，我也没有得到完整的输出，它只显示到JJ Bumrah，即只有第一个

标记。如果您对上述数据不清楚，请访问espn\ux中提供的url。选择TeamIndia并转到t20选项卡。我想删除我们在t20选项卡下看到的所有玩家的href链接。

html严重受损，您只需查看表格的前几行即可看到。您最好的选择是使用lxml或html5lib作为解析器，只需直接查找锚点并按步骤进行切片：

soup = BeautifulSoup(espn_.content , 'html5lib')

t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a")
for a in t20[1::2]:
   print(a)

这给了你：

<a href="/ci/content/player/27223.html">STR Binny</a>
<a href="/ci/content/player/290727.html">R Dhawan</a>
<a href="/ci/content/player/28671.html">FY Fazal</a>
<a href="/ci/content/player/290716.html">KM Jadhav</a>
<a href="/ci/content/player/326016.html">B Kumar</a>
<a href="/ci/content/player/481896.html">Mohammed Shami</a>
<a href="/ci/content/player/32540.html">CA Pujara</a>
<a href="/ci/content/player/33141.html">AT Rayudu</a>
<a href="/ci/content/player/34102.html">RG Sharma</a>
<a href="/ci/content/player/237095.html">M Vijay</a>