Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何从所选Xpath上方的Xpath获取文本?_Python_Xpath_Web Scraping_Scrapy - Fatal编程技术网

Python 如何从所选Xpath上方的Xpath获取文本?

Python 如何从所选Xpath上方的Xpath获取文本?,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,我想使用Scrapy从每个表行中提取数字 <tr> <td class="legend left value">1</td> <td colspan="4" class="legend title">Corners</td> <td class="legend right value">5</td> </tr> &l

我想使用Scrapy从每个表行中提取数字

     <tr>  
        <td class="legend left value">1</td>
        <td colspan="4" class="legend title">Corners</td>
        <td class="legend right value">5</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Shots on target</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">3</td>
        <td colspan="4" class="legend title">Shots wide</td>
        <td class="legend right value">8</td>
      </tr>
      <tr>  
        <td class="legend left value">14</td>
        <td colspan="4" class="legend title">Fouls</td>
        <td class="legend right value">14</td>
      </tr>
      <tr>  
        <td class="legend left value">2</td>
        <td colspan="4" class="legend title">Offsides</td>
        <td class="legend right value">4</td>
      </tr>

有人知道我做错了什么吗?

下面是一个具有不同阶段的scrapy shell会话示例:

  • 获取起始页
  • 抓取包含您所关注的统计信息的iframe,并获取它的
    src
    属性
  • 获取相应的iframe内容(这需要另一个
    请求
    ,在shell中只需使用
    fetch()
  • 查找包含数据的表,并仅在偶数位置拾取行
  • 在每行中,使用奇数位置单元格(1和3)表示数字,第二个单元格表示统计名称
  • 事情是这样的:

    scrapy shell "http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01"
    2014-08-21 11:06:19+0200 [scrapy] INFO: Scrapy 0.24.2 started (bot: scrapybot)
    2014-08-21 11:06:19+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
    2014-08-21 11:06:19+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-08-21 11:06:19+0200 [scrapy] INFO: Enabled item pipelines: 
    2014-08-21 11:06:19+0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    2014-08-21 11:06:19+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
    2014-08-21 11:06:19+0200 [default] INFO: Spider opened
    2014-08-21 11:06:19+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
    [s]   item       {}
    [s]   request    <GET http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
    [s]   response   <200 http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01>
    [s]   settings   <scrapy.settings.Settings object at 0x7fcfe8299ad0>
    [s]   spider     <Spider 'default' at 0x7fcfe7386b10>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
    In [1]: import urlparse
    
    In [2]: iframe_src = response.css('div.block_match_stats_plus_chart > iframe::attr(src)').extract()[0]
    
    In [3]: fetch(urlparse.urljoin(response.url, iframe_src))
    2014-08-21 11:06:35+0200 [default] DEBUG: Crawled (200) <GET http://int.soccerway.com/charts/statsplus/1686679/> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fcfe7bda550>
    [s]   item       {}
    [s]   request    <GET http://int.soccerway.com/charts/statsplus/1686679/>
    [s]   response   <200 http://int.soccerway.com/charts/statsplus/1686679/>
    [s]   settings   <scrapy.settings.Settings object at 0x7fcfe8299ad0>
    [s]   spider     <Spider 'default' at 0x7fcfe7386b10>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
    In [4]: stats = {}
    
    In [5]: for row in response.css('div.chart > table > tr:nth-child(even)'):
        name = row.css('td:nth-child(even)::text').extract()[0]
        stats[name] = map(int, row.css('td:nth-child(odd)::text').extract())
       ...:     
    
    In [6]: stats
    Out[6]: 
    {u'Corners': [1, 5],
     u'Fouls': [14, 14],
     u'Offsides': [2, 4],
     u'Shots on target': [2, 8],
     u'Shots wide': [3, 8]}
    
    In [7]: 
    
    scrapy shell”http://int.soccerway.com/matches/2014/08/08/france/ligue-1/stade-de-reims/paris-saint-germain-fc/1686679/?ICID=PL_MS_01"
    2014-08-21 11:06:19+0200[scrapy]信息:scrapy 0.24.2已启动(机器人:scrapybot)
    2014-08-21 11:06:19+0200[scrapy]信息:可选功能:ssl、http11、boto
    2014-08-21 11:06:19+0200[scrapy]信息:覆盖的设置:{'LOGSTATS_INTERVAL':0}
    2014-08-21 11:06:19+0200[scrapy]信息:启用的扩展:TelnetConsole、CloseSpider、WebService、CoreStats、SpiderState
    2014-08-21 11:06:19+0200[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
    2014-08-21 11:06:19+0200[scrapy]信息:启用的spider中间件:HttpErrorMiddleware、OffsiteMiddleware、referermidleware、urlengthmiddleware、DepthMiddleware
    2014-08-21 11:06:19+0200[scrapy]信息:启用的项目管道:
    2014-08-21 11:06:19+0200[scrapy]调试:Telnet控制台监听127.0.0.1:6023
    2014-08-21 11:06:19+0200[scrapy]调试:在127.0.0.1:6080上侦听Web服务
    2014-08-21 11:06:19+0200[默认]信息:蜘蛛网已打开
    2014-08-21 11:06:19+0200[默认]调试:爬网(200)(参考:无)
    [s] 可用的刮擦对象:
    [s] 爬虫
    [s] 项目{}
    [s] 请求
    [s] 回应
    [s] 背景
    [s] 蜘蛛
    [s] 有用的快捷方式:
    [s] shelp()Shell帮助(打印此帮助)
    [s] 获取(请求或url)获取请求(或url)并更新本地对象
    [s] 查看(响应)在浏览器中查看响应
    在[1]中:导入URL解析
    在[2]中:iframe\u src=response.css('div.block\u match\u stats\u plus\u chart>iframe::attr(src)')。extract()[0]
    在[3]中:fetch(urlparse.urljoin(response.url,iframe_src))
    2014-08-21 11:06:35+0200[默认]调试:爬网(200)(参考:无)
    [s] 可用的刮擦对象:
    [s] 爬虫
    [s] 项目{}
    [s] 请求
    [s] 回应
    [s] 背景
    [s] 蜘蛛
    [s] 有用的快捷方式:
    [s] shelp()Shell帮助(打印此帮助)
    [s] 获取(请求或url)获取请求(或url)并更新本地对象
    [s] 查看(响应)在浏览器中查看响应
    在[4]中:stats={}
    在[5]中:对于response.css('div.chart>table>tr:nth child(偶数)')中的行:
    name=row.css('td:n子级(偶数)::text')。extract()[0]
    stats[name]=map(int,row.css('td:n子级(奇数)::text').extract())
    ...:     
    在[6]中:stats
    出[6]:
    {u'Corners':[1,5],
    u‘犯规’:[14,14],
    u‘越位’:[2,4],
    u‘射中目标’:[2,8],
    u'Shots wide':[3,8]}
    在[7]中:
    
    您可以尝试这个XPath查询,我使用这个

    Html

    <table> 
          <tr>  
            <td class="legend left value">1</td>
            <td colspan="4" class="legend title">Corners</td>
            <td class="legend right value">5</td>
          </tr>
          <tr>  
            <td class="legend left value">2</td>
            <td colspan="4" class="legend title">Shots on target</td>
            <td class="legend right value">8</td>
          </tr>
          <tr>  
            <td class="legend left value">3</td>
            <td colspan="4" class="legend title">Shots wide</td>
            <td class="legend right value">8</td>
              </tr>
          <tr>  
            <td class="legend left value">1</td>
            <td colspan="4" class="legend title">Corners</td>
            <td class="legend right value">8</td>
          </tr>
          <tr>  
            <td class="legend left value">14</td>
            <td colspan="4" class="legend title">Fouls</td>
            <td class="legend right value">14</td>
          </tr>
          <tr>  
            <td class="legend left value">2</td>
            <td colspan="4" class="legend title">Offsides</td>
            <td class="legend right value">4</td>
          </tr>
          <tr>  
            <td class="legend left value">1</td>
            <td colspan="4" class="legend title">Corners</td>
            <td class="legend right value">3</td>
          </tr>
    </table>
    
    结果

    5
    8
    3
    

    您需要向我们指出您正在使用的确切文档——例如,如果它使用XML名称空间,那么您必须将查询更改为具有名称空间意识,否则它将永远找不到任何内容。(您的文档前面有一个
    xmlns=
    吗?),
    table
    tr
    元素之间没有
    tbody
    元素
    tbody
    由浏览器添加为干净的标记,但随后它们会将XPath与真正的HTML源代码弄乱。因此,我建议您不要在XPath中使用
    tbody
    (就像我在上面的示例会话中回答的那样)
    //td[@class="legend title" and contains(text(), "corner")]/following-sibling::td[1]/text()
    
    5
    8
    3