Python 3.x 通过生成适当的POST请求，使用scrapy进行基于Ajax的导航_Python 3.x_Web Scraping_Scrapy

Python 3.x 通过生成适当的POST请求，使用scrapy进行基于Ajax的导航

python-3.x web-scraping scrapy

Python 3.x 通过生成适当的POST请求，使用scrapy进行基于Ajax的导航,python-3.x,web-scraping,scrapy,Python 3.x,Web Scraping,Scrapy,我一直在尝试创建一个网站，该网站在链接元素上使用AJAX和onclick事件来控制页面导航。刮板在第一页工作，但从不从那里处理页面；因此，它似乎没有解雇我建立的POST请求我对所有这些（Python、scrapy、xPath、DOM）都是全新的，但我的直觉是，我混合了来自不同示例的不同结构模式，这些模式微妙地不兼容除了（新手）使用scrapy shell和输出日志消息之外，我还非常感谢您给我一些提示，告诉我如何更好地调试这个问题我的代码： import scrapy from scrapy

我一直在尝试创建一个网站，该网站在链接元素上使用AJAX和onclick事件来控制页面导航。刮板在第一页工作，但从不从那里处理页面；因此，它似乎没有解雇我建立的POST请求

我对所有这些（Python、scrapy、xPath、DOM）都是全新的，但我的直觉是，我混合了来自不同示例的不同结构模式，这些模式微妙地不兼容

除了（新手）使用scrapy shell和输出日志消息之外，我还非常感谢您给我一些提示，告诉我如何更好地调试这个问题

我的代码：

import scrapy
from scrapy import FormRequest


class FansSpider(scrapy.Spider):
    name = "fans"
    allowed_domains = ['za.rs-online.com/web/c/hvac-fans-thermal-management/fans/axial-fans/']
    start_urls = ['http://za.rs-online.com/web/c/hvac-fans-thermal-management/fans/axial-fans/']

    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)
        for component in response.xpath('//tr[@class="resultRow"]'):
            yield {
                'id': component.xpath('.//a[@class="primarySearchLink"]/text()').extract_first().strip()
            }

        next_id = response.xpath('//a[@class="rightLink nextLink approverMessageTitle"]/@id').extract_first()
        self.logger.info('Identified code of next URL as %s', next_id)

        if next_id is not None:
            first_id = response.xpath('//a[@class="rightLink nextLink approverMessageTitle"]/@onclick').\
                extract_first().split(',')[1].strip('\'')

            # POST the URL that is generated when clicking the next button
            return [FormRequest.from_response(response,
                                              url='http://za.rs-online.com/web/c/hvac-fans-thermal-management/fans/axial-fans/',
                                              formdata={'AJAXREQUEST': '_viewRoot',
                                                        first_id: first_id,
                                                        'ajax-dimensions': '',
                                                        'ajax-request': 'true',
                                                        'ajax-sort-by': '',
                                                        'ajax-sort-order': '',
                                                        'ajax-attrSort': 'false',
                                                        'javax.faces.viewState': 'j_id1',
                                                        next_id: next_id},
                                              callback=self.parse,
                                              dont_filter=True,
                                              dont_click=True,
                                              method='POST'
                                              )]

其他信息仅供参考：我对scrapy settings.py进行了以下更改，以避免被Web服务器阻止或被禁止：

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

谢谢