Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/87.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Html scrapy xpath未返回所需的结果。有什么想法吗?_Html_Xpath_Scrapy - Fatal编程技术网

Html scrapy xpath未返回所需的结果。有什么想法吗?

Html scrapy xpath未返回所需的结果。有什么想法吗?,html,xpath,scrapy,Html,Xpath,Scrapy,请看这一页。正如您所猜到的,我正在尝试刮除此页面上的所有字段。除答案字段外,所有字段都已正确设置。我发现奇怪的是,问答的页面结构几乎相同(表[1]和表[2]);这个问题很难回答,但答案却不是。以下是我的XPath: 问题: ['q_main'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()

请看这一页。正如您所猜到的,我正在尝试刮除此页面上的所有字段。除答案字段外,所有字段都已正确设置。我发现奇怪的是,问答的页面结构几乎相同(表[1]和表[2]);这个问题很难回答,但答案却不是。以下是我的XPath:

问题:

['q_main'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()
完美

答复:

['q_answer'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()
返回一个空白。我已经复制了完整的xpath,由xpath助手和控制台返回/验证。
我忽略了什么?我看不到什么?

您的xpath似乎有问题

查看scrapy shell的演示

In [1]: response.xpath('//tr[td[@class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[@class="griditemq"]//text()').extract()
Out[1]: 
[u'\r\n\r\n',
 u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY   (SHRI PIYUSH GOYAL)\r\n\r\n ',
 u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and  above  the  \r\nfixed  reserve  price  of  Rs. 100/-  per  tonne.\r\n\r\n',
 u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
 u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
 u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
 u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
 u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
 u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
 u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value.  ',
 u'\r\n\r\n\r\n']

当您处理
表时,有时会发生这种情况,有关更多信息,您可以参考。

至少部分困难源于这样一个事实,即您在控制台中看到的代码不是爬行器作为响应获得的源html(以及选择器操作的源html)。 特别是,
不包括
是非常常见的;但是,当浏览器将html转换为DOM树时,它会插入
标记。曾经有一段时间,网页的大部分布局实际上是通过(疯狂地)嵌套表来完成的。因此,这样一个网站的DOM通常比html源代码包含更多的
元素

这实际上意味着:

  • 通常,为要选择的元素找到一个相对简单的xpath(或CSS选择器,或…)是一个好主意,而不是从开发人员工具中获得的庞然大物
  • 在xpath中包含
    /tbody
    通常不是一个好主意(除非有关联的属性,表明标记存在于源html中)
  • 对于所讨论的站点

     response.xpath('//td[@class="griditemq"]').extract()
    

    返回一个列表,其中第一个元素是问题,第二个元素是答案。

    找到一个好的xpath表达式肯定是一门艺术,有时取决于运气,通常需要一些尝试和错误。一个好的起点通常是您的Inspect Element工具——在您想要提取的内容区域中进行游戏,寻找使其具有特殊性的特征(通常是
    class
    id
    属性),然后尝试它们。