Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python XPath没有使用迭代器(Scrapy)指向正确的HTML表元素_Python_Xpath_Web Crawler_Scrapy - Fatal编程技术网

Python XPath没有使用迭代器(Scrapy)指向正确的HTML表元素

Python XPath没有使用迭代器(Scrapy)指向正确的HTML表元素,python,xpath,web-crawler,scrapy,Python,Xpath,Web Crawler,Scrapy,我在使用XPath从表中选择带有Scrapy的HTML元素时遇到问题。 我正在使用的示例是来自Scrapy网站的非常基本的示例:我想要解析的网站是 首先,我使用了以下代码: from basketbase.items import BasketbaseItem from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor f

我在使用XPath从表中选择带有Scrapy的HTML元素时遇到问题。 我正在使用的示例是来自Scrapy网站的非常基本的示例:我想要解析的网站是

首先,我使用了以下代码:

from basketbase.items import BasketbaseItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse


class Basketspider(CrawlSpider):
    name = "playbyplay"
    download_delay = 0.5

    allowed_domains = ["www.euroleague.net"]
    start_urls = ["http://www.euroleague.net/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2013"]
    rules = (
        Rule(SgmlLinkExtractor(allow=(),),callback='parse_item',),        
    )  


    def parse(self,response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        return super(Basketspider,self).parse(response)

    def parse_item(self, response):
        response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
        sel = HtmlXPathSelector(response)

        items=[]
        item = BasketbaseItem()         
        item['game_time'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[1]/text()').extract() #
        item['game_event'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[2]/text()').extract() #
        item['game_event_res_home'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
        item['game_event_res_visitor'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[3]/text()').extract() #
        item['game_event_team'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[4]/text()').extract() #
        item['game_event_player'] = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr/td[5]/text()').extract() #          
        items.append(item)



        return items
好的,这是基本的,目前规则不是很正确,但是这个例子主要关注的是XPath

这是可行的,但不是我想要的方式。 我希望每个项目每个tr只提取一个td值,但使用此代码,它会立即将所有td元素提取到项目中。 项目游戏\活动\恢复\访客:

'game_event_res_visitor': [u'0-0',
                           u'0-0',
                           u'0-0',.......(list goes on and on)
为了得到我想要的结果,我决定在Scrapy教程中使用loop-like,但它根本不返回任何值。代码如下:

def parse(self,response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    return super(Basketspider,self).parse(response)

def parse_item(self, response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    sel = HtmlXPathSelector(response)
    sites = sel.xpath('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')        
    items=[]
    item = BasketbaseItem()
    for site in sites:

        item = BasketbaseItem()
        item['game_time'] = sel.select('td[1]/text()').extract() #
        item['game_event'] = sel.select('td[2]/text()').extract() #
        item['game_event_res_home'] = sel.select('td[3]/text()').extract() #
        item['game_event_res_visitor'] = sel.select('td[3]/text()').extract() #
        item['game_event_team'] = sel.select('td[4]/text()').extract() #
        item['game_event_player'] = sel.select('td[5]/text()').extract() #          
        items.append(item)



    return items
和终端输出:

2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [],
     'game_event_player': [],
     'game_event_res_home': [],
     'game_event_res_visitor': [],
     'game_event_team': [],
     'game_time': []}
2014-03-07 16:57:45+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=9&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [],
     'game_event_player': [],
     'game_event_res_home': [],
     'game_event_res_visitor': [],
     'game_event_team': [],
     'game_time': []}
它无法获得任何文本结果

2014-03-07 19:11:14+0200 [playbyplay] DEBUG: Scraped from <200 http://www.euroleague.net/main/results/showgame?gamecode=7&gamenumber=1&phasetypecode=RS&seasoncode=E2013>
    {'game_event': [u' \r\n', u'\r\n'],
     'game_event_player': [u' \r\n', u'\r\n'],
     'game_event_res_home': [u' \r\n', u'\r\n'],
     'game_event_res_visitor': [u' \r\n', u'\r\n'],
     'game_event_team': [u' \r\n', u'\r\n'],
     'game_time': [u' \r\n', u'\r\n']}

我很困惑,不明白我的XPath或代码有什么问题

以下是对我有效的方法:

def parse_item(self, response):
    response = HtmlResponse(url=response.url, status=response.status, headers=response.headers, body=response.body)
    sel = HtmlXPathSelector(response)

    rows = sel.select('//div[@style="overflow: auto; height: 250px; width: 800px;"]/table/tbody/tr')
    for row in rows:
        item = BasketbaseItem()
        item['game_time'] = row.select("td[1]/text()").extract()[0]
        item['game_event'] = row.select("td[2]/text()").extract()[0]
        result = row.select("td[3]/text()").extract()[0]
        item['game_event_res_home'], item['game_event_res_visitor'] = result.split('-')
        item['game_event_team'] = row.select("td[4]/text()").extract()[0]
        item['game_event_player'] = row.select("td[5]/text()").extract()[0]
        yield item
下面是我得到的一个例子:

{'game_event': u'Steal',
 'game_event_player': u'DJEDOVIC, NIHAD',
 'game_event_res_home': u'0 ',
 'game_event_res_visitor': u' 0',
 'game_event_team': u'FC Bayern Munich',
 'game_time': u'2'}
但对您来说,这只是一个开始——有时由于索引器异常而无法生成项目——请正确处理它


希望这能有所帮助。

有关编写更好的问题的提示,请参阅-将问题隔离到再现问题所需的最低代码,这会有很大帮助。是的,它会提供索引器,这是因为在游戏过程中得分时用于标记事件的元素。但现在蜘蛛工作了。
{'game_event': u'Steal',
 'game_event_player': u'DJEDOVIC, NIHAD',
 'game_event_res_home': u'0 ',
 'game_event_res_visitor': u' 0',
 'game_event_team': u'FC Bayern Munich',
 'game_time': u'2'}