Python Scrapy不分析项目
我正试图用pegination刮去一个网页,但是回电话时没有解析项目,任何帮助都将不胜感激……下面是代码Python Scrapy不分析项目,python,scrapy,Python,Scrapy,我正试图用pegination刮去一个网页,但是回电话时没有解析项目,任何帮助都将不胜感激……下面是代码 # -*- coding: utf-8 -*- import scrapy from ..items import EscrotsItem class Escorts(scrapy.Spider): name = 'escorts' allowed_domains = ['www.escortsandbabes.com.au'] start_urls = ['htt
# -*- coding: utf-8 -*-
import scrapy
from ..items import EscrotsItem
class Escorts(scrapy.Spider):
name = 'escorts'
allowed_domains = ['www.escortsandbabes.com.au']
start_urls = ['https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/']
def parse_links(self, response):
for i in response.css('.btn.btn-default.btn-block::attr(href)').extract()[2:]:
yield scrapy.Request(url=response.urljoin(i),callback=self.parse)
NextPage = response.css('.page.next-page::attr(href)').extract_first()
if NextPage:
yield scrapy.Request(
url=response.urljoin(NextPage),
callback=self.parse_links)
def parse(self, response):
for x in response.xpath('//div[@class="advertiser-profile"]'):
item = EscrotsItem()
item['Name'] = x.css('.advertiser-names--display-name::text').extract_first()
item['Username'] = x.css('.advertiser-names--username::text').extract_first()
item['Phone'] = x.css('.contact-number::text').extract_first()
yield item
您的代码从起始URL调用URL并转到解析函数。因为没有任何div.advertiser-profile元素,所以它确实应该在没有任何结果的情况下关闭。因此,根本不调用parse_links函数
更改函数名称:
import scrapy
class Escorts(scrapy.Spider):
name = 'escorts'
allowed_domains = ['escortsandbabes.com.au']
start_urls = ['https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/']
def parse(self, response):
for i in response.css('.btn.btn-default.btn-block::attr(href)').extract()[2:]:
yield scrapy.Request(response.urljoin(i), self.parse_links)
next_page = response.css('.page.next-page::attr(href)').get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
def parse_links(self, response):
for x in response.xpath('//div[@class="advertiser-profile"]'):
item = {}
item['Name'] = x.css('.advertiser-names--display-name::text').get()
item['Username'] = x.css('.advertiser-names--username::text').get()
item['Phone'] = x.css('.contact-number::text').get()
yield item
我的scrapy shell日志:
In [1]: fetch("https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/")
2019-03-29 15:22:56 [scrapy.core.engine] INFO: Spider opened
2019-03-29 15:23:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/> (referer: None, latency: 2.48 s)
In [2]: response.css('.page.next-page::attr(href)').get()
Out[2]: u'/Directory/ACT/Canberra/2600/Any/All/?p=2'
我已经从shell添加了日志。它应该看到。我还假设您要修复allowed_domains=['Guardsandabes.com.au']删除运行后写入日志中的WWW?我可以在shell中获取数据,但当我运行scrapy crawl Guardsays时,它不会刮除ItemsID您修复allowed domains变量吗?我已经添加了蜘蛛的完整代码。它现在可以工作了。我已经删除了允许的\u域,现在可以工作了,谢谢你的帮助!