Python Xpath()方法不使用Scrapy返回结果
我在Windows Vista 64位上使用Python.org 2.7 64位版本。我有一些零碎的代码,试图解析标题为“韦恩·鲁尼的比赛记录”的链接中的表格:“ 到目前为止,我掌握的代码如下:Python Xpath()方法不使用Scrapy返回结果,python,xpath,scrapy,Python,Xpath,Scrapy,我在Windows Vista 64位上使用Python.org 2.7 64位版本。我有一些零碎的代码,试图解析标题为“韦恩·鲁尼的比赛记录”的链接中的表格:“ 到目前为止,我掌握的代码如下: from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.utils.markup import remove_tags from scrapy.cmdline import execute im
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re
class MySpider(Spider):
name = "wiki"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
for row in response.selector.xpath('//table[@id="player-fixture"]//tr[td[@class="tournament"]]'):
# Is this row contains goal symbols?
list_of_goals = row.xpath('//span[@title="Goal"]')
if list_of_goals:
list = str(list_of_goals)
print remove_tags(list).encode('utf-8')
execute(['scrapy','crawl','wiki'])
这将返回表中除目标数据之外的所有数据(它也不返回辅助,但我还没有为此添加任何逻辑。这段代码是对我的原始代码的开发,它既没有返回目标也没有返回辅助:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re
class MySpider(Spider):
name = "goal"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
titles = response.selector.xpath("normalize-space(//title)")
for titles in titles:
body = response.xpath('//table[@id="player-fixture"]//tr[td[@class="tournament"]]').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal'])
HTML源中指示目标的语句如下:
<span class="incident-wrapper"><span class="incidents-icon ui-icon goal" title="Goal"></span></span>
有人能告诉我为什么我在顶部列出的代码不返回用这种逻辑得分的进球吗?这是否与球图标用来表示进球而不是一个单词的事实有关
谢谢在第一个版本中,您只得到了
,并且没有文本,因此您得到的结果是空字符串-因为您删除了\u标记()
为带有“目标图标”的行添加字符串“目标”:
和部分结果
titles: Wayne Rooney Match History | WhoScored.com
date: 17-08-2013
result: 1 : 4
team_home: Swansea
team_away: Manchester United
info: 28' Minutes played in this match
rating: 7.26
incidents: Assist, Assist
----------------------------------------
date: 26-08-2013
result: 0 : 0
team_home: Manchester United
team_away: Chelsea
info: 90' Minutes played in this match
rating: 7.03
incidents:
----------------------------------------
date: 14-09-2013
result: 2 : 0
team_home: Manchester United
team_away: Crystal Palace
info: 90' Minutes played in this match
rating: 8.44
incidents: Man of the Match, Goal
----------------------------------------
date: 17-09-2013
result: 4 : 2
team_home: Manchester United
team_away: Bayer Leverkusen
info: 84' Minutes played in this match
rating: 9.18
incidents: Goal, Goal, Assist
----------------------------------------
date: 22-09-2013
result: 4 : 1
team_home: Manchester City
team_away: Manchester United
info: 90' Minutes played in this match
rating: 7.17
incidents: Goal, Yellow Card
----------------------------------------
date: 25-09-2013
result: 1 : 0
team_home: Manchester United
team_away: Liverpool
info: 90' Minutes played in this match
rating:
incidents: Man of the Match, Assist
----------------------------------------
我曾尝试在不移除标记的情况下打印字符串,也尝试过只打印“目标列表”,每一个目标在python IDLE中打印时每行返回一个结果,如下所示:“EPL,14-09-2013”曼联,2:0,水晶宫,90,'8.44”(我添加了逗号)…所有由图标而非纯文本表示的数据仍然没有返回。我还尝试了以下逻辑,但仍然没有将目标数据打印到屏幕:“list_of_goals=row.xpath('//span[@class=“event wrapper”]//span[@class=“events icon ui icon-goal”]///span[@title=“goal”]”)'由图标表示的数据是空字符串,因此您无法从
获取任何结果-所有文本都在
之外-忘记
。最终,您可以在找到“”时将自己的字符串“目标”添加到结果中。
list_of_goals = row.xpath('//span[@title="Goal"]')
if list_of_goals:
print "GOAL" # <-- string
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.cmdline import execute
class MySpider(Spider):
name = "goal"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]
def parse(self, response):
sel = Selector(response)
#titles = sel.xpath("normalize-space(//title)")
#print 'titles:', titles.extract()[0]
print
print 'titles:', "".join( sel.css("title::text").extract() ).strip()
print
#rows = sel.xpath('//table[@id="player-fixture"]//tbody//tr')
rows = sel.css('table#player-fixture tbody tr')
for row in rows:
#print 'date:', row.xpath('.//td[@class="date"]/text()').extract()
#print 'result:', row.xpath('.//td[@class="result"]/a/text()').extract()
print 'date:', "".join( row.css('.date::text').extract() ).strip()
print 'result:', "".join( row.css('.result a::text').extract() ).strip()
print 'team_home:', "".join( row.css('.team.home a::text').extract() ).strip()
print 'team_away:', "".join( row.css('.team.away a::text').extract() ).strip()
print 'info:', "".join( row.css('.info::text').extract() ).strip(), "".join( row.css('.info::attr(title)').extract() ).strip()
print 'rating:', "".join( row.css('.rating::text').extract() ).strip()
print 'incidents:', ", ".join( row.css('.incidents-icon::attr(title)').extract() ).strip()
print '-'*40
#execute(['scrapy','crawl','goal'])
execute(['scrapy','runspider','main.py'])
titles: Wayne Rooney Match History | WhoScored.com
date: 17-08-2013
result: 1 : 4
team_home: Swansea
team_away: Manchester United
info: 28' Minutes played in this match
rating: 7.26
incidents: Assist, Assist
----------------------------------------
date: 26-08-2013
result: 0 : 0
team_home: Manchester United
team_away: Chelsea
info: 90' Minutes played in this match
rating: 7.03
incidents:
----------------------------------------
date: 14-09-2013
result: 2 : 0
team_home: Manchester United
team_away: Crystal Palace
info: 90' Minutes played in this match
rating: 8.44
incidents: Man of the Match, Goal
----------------------------------------
date: 17-09-2013
result: 4 : 2
team_home: Manchester United
team_away: Bayer Leverkusen
info: 84' Minutes played in this match
rating: 9.18
incidents: Goal, Goal, Assist
----------------------------------------
date: 22-09-2013
result: 4 : 1
team_home: Manchester City
team_away: Manchester United
info: 90' Minutes played in this match
rating: 7.17
incidents: Goal, Yellow Card
----------------------------------------
date: 25-09-2013
result: 1 : 0
team_home: Manchester United
team_away: Liverpool
info: 90' Minutes played in this match
rating:
incidents: Man of the Match, Assist
----------------------------------------