Python 不带'的爬网表数据；下一步按钮'；用刮痧_Python_Web Scraping_Scrapy

Python 不带'的爬网表数据；下一步按钮'；用刮痧

python web-scraping scrapy

Python 不带'的爬网表数据；下一步按钮'；用刮痧,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我对Scrapy很陌生，我试图从中获取每页的表数据这是我的代码： import scrapy class UAESpider(scrapy.Spider): name = 'uae_free' allowed_domains = ['https://www.uaeonlinedirectory.com'] start_urls = [ 'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.a

我对

Scrapy

很陌生，我试图从中获取每页的表数据

这是我的代码：

import scrapy

class UAESpider(scrapy.Spider):
    name = 'uae_free'

    allowed_domains = ['https://www.uaeonlinedirectory.com']

    start_urls = [
        'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
    ]
    
    def parse(self, response):
        pages = response.xpath('//table[@class="GridViewStyle"]//tr[12]')

        for page in pages[1:11]:
            rows = page.xpath('//table[@class="GridViewStyle"]//tr')
            for row in rows[1:11]:
                yield {
                    'company_name': row.xpath('.//td[2]//text()').get(),
                    'company_name_link': row.xpath('.//td[2]//a/@href').get(),
                    'zone': row.xpath('.//td[4]//text()').get(),
                    'category': row.xpath('.//td[6]//text()').get(),
                    'category_link': row.xpath('.//td[6]//a/@href').get()
                }

        next_page = response.xpath('//table[@class="GridViewStyle"]//tr[12]//td[11]//a/@href').get()

        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse)

但它不起作用，我得到了这个错误，下面的URL是指向第11页的链接：

ValueError: Missing scheme in request url: javascript:__doPostBack('ctl00$ContentPlaceHolder2$grdDirectory','Page$11')

你们知道怎么修复这个错误吗

更新：

按照@zmike建议的说明进行操作，这就是我目前所做的：

import scrapy
from scrapy.http import FormRequest

URL = 'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'

class UAESpider(scrapy.Spider):
    name = 'uae_free'

    allowed_domains = ['https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A']

    start_urls = [
        'https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A'
    ]

    def parse(self, response):
        self.data = {}

        for form_input in response.css('form#aspnetForm input'):
            name = form_input.xpath('@name').extract()[0]
            try:
                value = form_input.xpath('@value').extract()[0]
            except IndexError:
                value = ""
            self.data[name] = value

        self.data['ctl00_ContentPlaceHolder2_panelGrid'] = 'ctl00$ContentPlaceHolder2$grdDirectory'
        self.data['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder2$grdDirectory'
        self.data['__EVENTARGUMENT'] = 'Page$1'

        return FormRequest(url=URL,
                            method='POST',
                            callback=self.parse_page,
                            formdata=self.data,
                            meta={'page':1},
                            dont_filter=True)

    def parse_page(self, response):
        current_page = response.meta['page'] + 1
        rows = response.xpath('//table[@class="GridViewStyle"]//tr')
        for row in rows[1:11]:
            yield {
                'company_name': row.xpath('.//td[2]//text()').get(),
                'company_name_link': row.xpath('.//td[2]//a/@href').get(),
                'zone': row.xpath('.//td[4]//text()').get(),
                'category': row.xpath('.//td[6]//text()').get(),
                'category_link': row.xpath('.//td[6]//a/@href').get()
            }

        return FormRequest(url=URL,
                            method='POST',
                            formdata={
                                '__EVENTARGUMENT': 'Page$%d' % current_page,
                                '__EVENTTARGET': 'ctl00$ContentPlaceHolder2$grdDirectory',
                                'ctl00_ContentPlaceHolder2_panelGrid':'ctl00$ContentPlaceHolder2$grdDirectory',
                                '':''},
                            meta={'page': current_page},
                           dont_filter=True)

这段代码只从第一个页面获取表数据，它不会移动到剩余页面。你知道我哪里做错了吗？

这里是一个正在运行的（尽管很粗糙）爬虫实现，它可以遍历所有页面。一些注意事项：

表单数据需要不同的参数，例如
```
\uuu EVENTTARGET
```
，
```
\uu EVENTVALIDATION
```
，
```
\uu VIEWSTATEGENERATOR
```
等。
- 我使用XPath来获取它们，而不是正则表达式

以下内容不是必需的：

self.data['ctl00\u contentplaceholder 2\u panelGrid']=“ctl00$contentplaceholder 2$grdDirectory”

为了简单起见，我组合了这些函数回调允许它在所有页面中循环

import scrapy
从scrapy.http导入FormRequest
等级UAESpider（刮毛蜘蛛）：
名称='阿联酋自由'
标题={
'X-MicrosoftAjax'：'Delta=true'，
“用户代理”：“Mozilla/5.0（Macintosh；英特尔Mac OS X 10_10_2）AppleWebKit/537.36（KHTML，如Gecko）Chrome/41.0.2272.76 Safari/537.36”
}
允许的_域=['www.uaeonlinedirectory.com']
#TODO：包括所有其他项目的URL（例如A-Z）
起始URL=['https://www.uaeonlinedirectory.com/UFZOnlineDirectory.aspx?item=A']
当前页面=0
def解析（自我，响应）：
#请求下一页
self.current\u页面=self.current\u页面+1
如果self.current_page==1：
#提交表格（第一页）
数据={}
对于response.css中的form#u input（'form#aspnetForm input'）：
name=form_input.xpath（'@name'）.extract（）[0]
尝试：
value=form_input.xpath（'@value'）.extract（）[0]
除索引器外：
value=“”
数据[名称]=值
数据[''事件目标']='ctl00$MainContent$List'
数据[''事件参数']='第1页'
其他：
#使用XPATH提取表单字段和参数
event\u validation=response.xpath（'//input[@id=“\uu EVENTVALIDATION”]/@value”）.extract（）
view_state=response.xpath（'//input[@id=“u VIEWSTATE”]/@value”）.extract（）
view\u state\u generator=response.xpath（'//input[@id=“\u viewstategerator”]/@value”）.extract（）
view_state_encrypted=response.xpath（'//input[@id=“u VIEWSTATEENCRYPTED”]/@value”）.extract（）
数据={
“_EVENTTARGET”：“ctl00$ContentPlaceholder 2$GRD目录”，
“\uu EVENTARGUMENT”：”第$%d页“%self.current\u页，
“事件验证”：事件验证，
“视图状态”：视图状态，
“\u视图状态生成器”：视图状态生成器，
“\u视图状态加密”：视图状态加密，
“\uu ASYNCPOST”：“true”，
'': ''
}
#公司收益率
#TODO:将其移动到其他函数
rows=response.xpath（'//table[@class=“GridViewStyle”]//tr'）
对于行中的行[1:11]：
结果={
“公司名称”：row.xpath（“.//td[2]//text（）”）.get（），
'company_name_link'：row.xpath（'.//td[2]//a//@href'）.get（），
“区域”：行.xpath（'.//td[4]//text（））.get（），
'category'：row.xpath（'.//td[6]//text（）'）。get（），
'category_link'：row.xpath（'.//td[6]//a//@href'）.get（）
}
打印（结果）
产量结果
#TODO:检查是否有下一页，只有在有下一页时才进行屈服
yield FormRequest（url=self.start_url[0]，#TODO:更改此项，使索引不被硬编码
方法='POST'，
formdata=数据，
callback=self.parse，
meta={'page'：self.current_page}，
Don_filter=True，
headers=self.headers）

也许这会有所帮助：@zmike我会关注你的链接，在更新部分，这就是我目前所做的。但它不起作用，你知道我哪里错了吗？