Python Scrapy网站爬虫返回无效路径错误
我是新来的刮痧和以下的基本文件 我有一个网站,我正试图从中获取一些链接,以便在其中导航一些链接。我特别想得到科克洛尔、大学和计算机,我使用下面的代码Python Scrapy网站爬虫返回无效路径错误,python,html,xpath,web-scraping,scrapy,Python,Html,Xpath,Web Scraping,Scrapy,我是新来的刮痧和以下的基本文件 我有一个网站,我正试图从中获取一些链接,以便在其中导航一些链接。我特别想得到科克洛尔、大学和计算机,我使用下面的代码 import scrapy class DmozSpider(scrapy.Spider): name = "snopes" allowed_domains = ["snopes.com"] start_urls = [ "http://www.snopes.com/info/whatsn
import scrapy
class DmozSpider(scrapy.Spider):
name = "snopes"
allowed_domains = ["snopes.com"]
start_urls = [
"http://www.snopes.com/info/whatsnew.asp"
]
def parse(self, response):
print response.xpath('//div[@class="navHeader"]/ul/')
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
这是我的错误
2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines:
2015-10-03 23:17:29 [scrapy] INFO: Spider opened
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200) <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse
print response.xpath('//div[@class="navHeader"]/ul/')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath
return self.selector.xpath(query)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished)
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats:
您只需删除尾随的
/
。替换:
//div[@class="navHeader"]/ul/
与:
注意,这个XPath实际上与页面上的任何内容都不匹配。ul
元素是导航标题的同级元素-使用:
在我显示的代码中,
ul
元素不是navHeader
类的子元素吗?@ralphie9224不,请注意关闭的div
。让人困惑的是缩进。
//div[@class="navHeader"]/ul/
//div[@class="navHeader"]/ul
In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract()
Out[1]:
[u'Autos',
u'Business',
u'Cokelore',
u'College',
# ...
u'Weddings']