Scrapy-下载response.body时不同的页面内容
我正在尝试解析页面,例如Scrapy-下载response.body时不同的页面内容,scrapy,scrapy-spider,scrapy-shell,Scrapy,Scrapy Spider,Scrapy Shell,我正在尝试解析页面,例如www.page.com/results?sort=price。我正在用以下代码对其进行解析: def start_requests(self): start_urls = [ "www.page.com/results?sort=price", ] for url in start_urls: yield scrapy.Request(url=url, callback=self.parse) def parse
www.page.com/results?sort=price
。我正在用以下代码对其进行解析:
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# some code
next_page = "www.page.com/results?sort=price&type=12"
yield response.follow(next_page, self.get_models)
def get_models(self, response):
f = open('/tmp/test/file1.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price&type=12",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.get_models)
def get_models(self, response):
f = open('/tmp/test/file2.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
输出文件与此代码生成的文件不同:
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# some code
next_page = "www.page.com/results?sort=price&type=12"
yield response.follow(next_page, self.get_models)
def get_models(self, response):
f = open('/tmp/test/file1.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
def start_requests(self):
start_urls = [
"www.page.com/results?sort=price&type=12",
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.get_models)
def get_models(self, response):
f = open('/tmp/test/file2.txt', 'w')
f.write(response.url)
f.write(response.body.decode('utf-8'))
f.close()
当我通过
scrapyshell'www.page.com/results?sort=price&type=12'
下载页面时,输出类似于file2.txt
。问题是,在file1.txt中,没有包含需要爬网的数据的标记。这两种抓取页面的方式有什么区别,为什么下载的文件不同?我认为在第二种情况下,您将指向错误的url。检查您的日志以确保。我不知道该怎么做。我看不出有任何理由在这里使用它,因为您使用的是完整的URL(不仅仅是路径)。
尝试将其更改为simpleRequest
def parse(self, response):
# some code
next_page = "www.page.com/results?sort=price&type=12"
yield scrapy.Request(next_page, self.get_models)
在不知道真实URL或输出日志的情况下很难说,但第一个链接可能添加了一些cookie,从而改变了第二个链接的行为。