选择器在scrapy for python中不返回任何内容
我正在使用选择器在scrapy for python中不返回任何内容,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我正在使用htmlResponse和selector,htmlResponse返回站点,但当我检查选择器(响应)时,它会显示,即使htmlResponse返回此信息 <200 http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Tr eetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.htmlhttp://www.tr ip
htmlResponse
和selector
,htmlResponse
返回站点
,但当我检查选择器(响应)时,它会显示
,即使htmlResponse
返回此信息
<200 http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Tr
eetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.htmlhttp://www.tr
ipadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman
_and_Diu.html>
如上所述,您没有设置响应
对象的主体
你为什么不用你站点的URL数组生成一个新的请求
,让Scrapy来清理它们呢?你目前正在做的事情不会成功
当然,在这种情况下,您需要调整您的解析器方法或编写一个新的解析器方法,并将其作为回调添加到请求中(我将执行第二个版本)。垃圾是的,忘记了主体。哇,这是一个浪费时间的问题。只要产生一个新的请求,它就会立即调用它。您将主体设置为什么?在您的情况下,您没有主体,因此您需要生成请求
,或者使用其他方式获取内容(例如使用urlopen
)。
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import scrapy
from scrapy.http import HtmlResponse
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
# base_uri = ["tripadvisor.in"]
def start_requests(self):
site_array=["http://www.tripadvisor.in/Hotel_Review-g3581633-d2290190-Reviews-Corbett_Treetop_Riverview-Marchula_Jim_Corbett_National_Park_Uttarakhand.html"
"http://www.tripadvisor.in/Hotel_Review-g297600-d8029162-Reviews-Daman_Casa_Tesoro-Daman_Daman_and_Diu.html",
"http://www.tripadvisor.in/Hotel_Review-g304557-d2519662-Reviews-Darjeeling_Khushalaya_Sterling_Holidays_Resort-Darjeeling_West_Bengal.html",
"http://www.tripadvisor.in/Hotel_Review-g319724-d3795261-Reviews-Dharamshala_The_Sanctuary_A_Sterling_Holidays_Resort-Dharamsala_Himachal_Pradesh.html",
"http://www.tripadvisor.in/Hotel_Review-g1544623-d8029274-Reviews-Dindi_By_The_Godavari-Nalgonda_Andhra_Pradesh.html"]
for i in range(len(site_array)):
response = HtmlResponse(site_array[i])
sels = Selector(response)
sites = sels.xpath('//a[contains(text(), "Next")]/@href').extract()
print "________________________________________________________________"
print sels
print "________________________________________________________________"
if(sites and len(sites) > 0):
for site in sites:
yield Request(site_array[i],self.parse)