Python 删除包含多个子页面/无子页面的asp.net页面:在if-else语句中生成
以下是spyder.py文件:Python 删除包含多个子页面/无子页面的asp.net页面:在if-else语句中生成,python,asp.net,web-scraping,scrapy,yield,Python,Asp.net,Web Scraping,Scrapy,Yield,以下是spyder.py文件: import scrapy from scrapy_spider.items import JobsItem class JobSpider(scrapy.Spider): name = 'burzarada' start_urls = ['https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx'] download_delay = 1.5 def pars
import scrapy
from scrapy_spider.items import JobsItem
class JobSpider(scrapy.Spider):
name = 'burzarada'
start_urls = ['https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx']
download_delay = 1.5
def parse(self, response):
for href in response.css('div.NKZbox > div.KategorijeBox > a ::attr(href)').extract():
eventTarget = href.replace("javascript:__doPostBack('", "").replace("','')", "")
eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
viewState = response.css('#__VIEWSTATE::attr(value)').extract()
viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()
yield scrapy.FormRequest(
'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx',
formdata = {
'__EVENTTARGET': eventTarget,
'__EVENTARGUMENT': eventArgument,
'__LASTFOCUS': lastFocus,
'__VIEWSTATE': viewState,
'__VIEWSTATEGENERATOR': viewStateGenerator,
'__VIEWSTATEENCRYPTED': viewStateEncrypted,
},
callback=self.parse_category
)
def parse_category(self, response):
href = response.xpath('//select[@id="ctl00_MainContent_ddlPageSize"]').extract()
eventTarget = "ctl00$MainContent$ddlPageSize"
eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
viewState = response.css('#__VIEWSTATE::attr(value)').extract()
viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()
pageSize = '75'
sort = '0'
yield scrapy.FormRequest(
'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx',
formdata = {
'__EVENTTARGET': eventTarget,
'__EVENTARGUMENT': eventArgument,
'__LASTFOCUS': lastFocus,
'__VIEWSTATE': viewState,
'__VIEWSTATEGENERATOR': viewStateGenerator,
'__VIEWSTATEENCRYPTED': viewStateEncrypted,
'ctl00$MainContent$ddlPageSize': pageSize,
'ctl00$MainContent$ddlSort': sort,
},
callback=self.parse_multiple_pages
)
def parse_multiple_pages(self, response):
hrefs = response.xpath('//*[@id="ctl00_MainContent_gwSearch"]//tr[last()]//li/a/@href').extract()
##################################
# Here is the part of problem
if len(hrefs) != 0: # yield statement
for href in hrefs:
eventTarget = href.replace("javascript:__doPostBack('", "").replace("','')", "")
eventArgument = response.css('#__EVENTARGUMENT::attr(value)').extract()
lastFocus = response.css('#__LASTFOCUS::attr(value)').extract()
viewState = response.css('#__VIEWSTATE::attr(value)').extract()
viewStateGenerator = response.css('#__VIEWSTATEGENERATOR::attr(value)').extract()
viewStateEncrypted = response.css('#__VIEWSTATEENCRYPTED::attr(value)').extract()
pageSize = '75'
sort = '0'
print(eventTarget)
yield scrapy.FormRequest(
'https://burzarada.hzz.hr/Posloprimac_RadnaMjesta.aspx',
formdata = {
'__EVENTTARGET': eventTarget,
'__EVENTARGUMENT': eventArgument,
'__LASTFOCUS': lastFocus,
'__VIEWSTATE': viewState,
'__VIEWSTATEGENERATOR': viewStateGenerator,
'__VIEWSTATEENCRYPTED': viewStateEncrypted,
'ctl00$MainContent$ddlPageSize': pageSize,
'ctl00$MainContent$ddlSort': sort,
},
callback=self.parse_links
)
else: # another yield
for link in links:
link = 'https://burzarada.hzz.hr/' + link
yield scrapy.Request(url=link, callback=self.parse_job)
##########################################
def parse_links(self, response):
links = response.xpath('//a[@class="TitleLink"]/@href').extract()
for link in links:
link = 'https://burzarada.hzz.hr/' + link
yield scrapy.Request(url=link, callback=self.parse_job)
def parse_job(self, response):
item = JobsItem()
item['url'] = ''
item['title'] = ''
item['workplace'] = ''
item['required_workers'] = ''
item['type_of_employment'] = ''
item['working_hours'] = ''
item['mode_of_operation'] = ''
item['accomodation'] = ''
item['transportation_fee'] = ''
item['start_date'] = ''
item['end_date'] = ''
item['education_level'] = ''
item['work_experience'] = ''
item['other_information'] = ''
item['employer'] = ''
item['contact'] = ''
item['driving_test'] = ''
yield item
您可以看到页面结构不是很复杂
这是指向我想抓取的页面的链接
页面上有16个超链接,每个超链接都会发布请求,以在列表中获得不同数量的作业
第一个链接有1000个。作业列表的视口比例设置为25,因此第一个链接没有子页面,第二个链接有10多个子页面
我设法将它们改为75,这样我就不必处理许多子页面。问题将转到下一部分
问题是,我无法在第一个链接(没有子页面的链接)上获取任何项目。仅从第二个链接(包含10+子页面的链接)开始进行刮取。我试图在代码中使用几个print()来遵循流程(为了简洁明了而删除),但我发现它从未触及else:部分
如果我只尝试第一个页面(将for循环限制为在函数parse()中只运行一次),那么它可以正常工作
我已经为此挣扎了几个小时,但我找不到任何有用的答案
我猜这是因为第一个链接中没有子页面。如果它有一些,那么我就不必添加,如果还有
有人能帮我吗?我已经启动了该代码,它似乎总体上正常工作 刮削仅从第二个链接开始 实际上,它尝试与第一类一起工作。问题是没有定义
链接
,爬行器出现故障。异常-name错误:未定义名称“链接”
。Scrapy可能无法解析页面,但这不会停止整个爬虫程序,因此Scrapy将继续处理具有分页功能的页面
您还可以在spider中的第一个请求中包含分页和排序。在这种情况下,您可以通过删除parse\u category
来简化爬行器
还有这个选择器
hrefs = response.xpath('//*[@id="ctl00_MainContent_gwSearch"]//tr[last()]//li/a/@href').extract()
可以简单得多:
hrefs = response.xpath('//ul[contains(@class, "pagination")]//a/@href').extract()
把上面提到的每件事都考虑进去可能会简单一些。谢谢你,但我没有得到我想要的结果。它仍然拒绝阅读其他页面。它只在一页纸里转了又转。让我向您展示我正在调试的代码@Nikita如果在函数的开头添加
parse\u multiple\u pages,title-{response.xpath('//span[@id=“ctl00\u MainContent\u lblResults”]//text()').extract()}>”
,您将看到您正在阅读同一页。您肯定在解析\u类别
中有一些错误。我文章中pastebin链接中的代码被简化,因此它没有parse_category
和错误。一点更新的代码和很少的评论它工作!非常感谢你。你拯救了这一天!!!