Web scraping 如何使用Scrapy FormRequest在分页的.asp站点上模拟下一页链接请求
我在抓取此页时遇到问题: 我的scraper获得子页面的所有链接并正确地刮取这些链接(25个结果),但没有正确地提交表单请求以获得下一个25个要刮取的结果(依此类推)。我将感谢任何人能提供的任何帮助。谢谢Web scraping 如何使用Scrapy FormRequest在分页的.asp站点上模拟下一页链接请求,web-scraping,pagination,scrapy,Web Scraping,Pagination,Scrapy,我在抓取此页时遇到问题: 我的scraper获得子页面的所有链接并正确地刮取这些链接(25个结果),但没有正确地提交表单请求以获得下一个25个要刮取的结果(依此类推)。我将感谢任何人能提供的任何帮助。谢谢 import scrapy class ParcelScraperSpider(scrapy.Spider): name = 'parcel_scraper' start_urls = ['http://maps.kalkaskacounty.net/propertysear
import scrapy
class ParcelScraperSpider(scrapy.Spider):
name = 'parcel_scraper'
start_urls = ['http://maps.kalkaskacounty.net/propertysearch.asp?PDBsearch=setdo',
'http://maps.kalkaskacounty.net/,']
def parse(self,response):
for href in response.css('a.PDBlistlink::attr(href)'):
yield response.follow(href, self.parse_details)
def next_group(self,response):
return scrapy.FormRequest.from_response(
response,
formdata={'DBVpage':'next'},
formname={'PDBquery'},
callback=self.parse,
)
def parse_details(self,response):
yield {
'owner_name': response.xpath('//td[contains(text(),"Owner Name")]/following::td[1]/text()').extract_first(),
'jurisdiction': response.xpath('//td[contains(text(),"Jurisdiction")]/following::td[1]/text()').extract_first(),
'property_street': response.xpath('//td[contains(text(),"Property Address")]/following::td[1]/div[1]/text()').extract_first(),
'property_csz': response.xpath('//td[contains(text(),"Property Address")]/following::td[1]/div[2]/text()').extract_first(),
'owner_street': response.xpath('//td[contains(text(),"Owner Address")]/following::td[1]/div[1]/text()').extract_first(),
'owner_csz': response.xpath('//td[contains(text(),"Owner Address")]/following::td[1]/div[2]/text()').extract_first(),
'current_tax_value': response.xpath('//td[contains(text(),"Current Taxable Value")]/following::td[1]/text()').extract_first(),
'school_district': response.xpath('//td[contains(text(),"School District")]/following::td[1]/text()').extract_first(),
'current_assess': response.xpath('//td[contains(text(),"Current Assessment")]/following::td[1]/text()').extract_first(),
'current_sev': response.xpath('//td[contains(text(),"Current S.E.V.")]/following::td[1]/text()').extract_first(),
'current_pre': response.xpath('//td[contains(text(),"Current P.R.E.")]/following::td[1]/text()').extract_first(),
'prop_class': response.xpath('//td[contains(text(),"Current Property Class")]/following::td[1]/text()').extract_first(),
'tax_desc': response.xpath('//h3[contains(text(),"Tax Description")]/following::div/text()').extract_first()
}
通过查看代码,您永远不会调用“next_group”def类。您调用“parse”和“parse\u details”,但无法调用下一个\u组 在这里,使用元标记可以帮助您实现您的目标: ***只是一个例子;不恢复代码:
# -*- coding: utf-8 -*-
import scrapy
class YourSpiderClassHere(scrapy.Spider):
name = "..."
allowed_domains = ["SomeSite.com"]
start_urls = ['https://somesite.com/myScrappingSite']
def parse(self, response):
listings = response.xpath('//li[@class="result-row"]')
for listing in listings:
date = listing.xpath('.//*[@class="result-date"]/@datetime').extract_first()
link = listing.xpath('.//a[@class="result-title hdrlnk"]/@href').extract_first()
text = listing.xpath('.//a[@class="result-title hdrlnk"]/text()').extract_first()
yield scrapy.Request(link,
callback=self.parse_listing,
meta={'date': date,
'link': link,
'text': text})
next_page_url = response.xpath('//a[text()="next > "]/@href').extract_first()
if next_page_url:
yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)
def parse_listing(self, response):
date = response.meta['date']
link = response.meta['link']
text = response.meta['text']
compensation = response.xpath('//*[@class="attrgroup"]/span[1]/b/text()').extract_first()
type = response.xpath('//*[@class="attrgroup"]/span[2]/b/text()').extract_first()
address = response.xpath('//*[@id="postingbody"]/text()').extract()
yield {'date': date,
'link': link,
'text': text,
'compensation': compensation,
'type': type,
'address': address}
谢谢你,尼尔。今晚我将尝试用你的建议更新我的爬行器。我将此添加到我的爬行器中:
next\u page=response.css('div.ccrow div.cc2:nth child(3)a.DBVpagelink::attr(href)')。如果next\u page不是None,则提取\u first():next\u page=response.urljoin(next\u page)产生scrapy.Request(next\u page,callback=self.parse)
但它没有返回任何其他结果。response.css('div.ccrow div.cc2:nth child(3)a.DBVpagelink::attr(href)')).extract_first()返回“javascript:document.PDBquery.DBVpage.value='next';document.PDBquery.submit();”作为下一页链接,这就是我询问FormRequest的原因