Python Scrapy:如何使用scraped item作为动态URL的变量
我想开始对最后一个页码进行刮削。从最高页到最低页 第2267页是动态的,所以在确定最后一个页码之前,我需要先刮除该项,然后url分页应该与第2267页、第2266页类似 这就是我所做的Python Scrapy:如何使用scraped item作为动态URL的变量,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我想开始对最后一个页码进行刮削。从最高页到最低页 第2267页是动态的,所以在确定最后一个页码之前,我需要先刮除该项,然后url分页应该与第2267页、第2266页类似 这就是我所做的 class TeslamotorsclubSpider(scrapy.Spider): name = 'teslamotorsclub' allowed_domains = ['teslamotorsclub.com'] start_urls = ['https://teslamotor
class TeslamotorsclubSpider(scrapy.Spider):
name = 'teslamotorsclub'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
def parse(self, response):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
for item in response.css("[id^='fc-post-']"):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"last_page":last_page,"message":message,"datatime":datime}
next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-' + str(TeslamotorsclubSpider.last_page)
print(next_page)
TeslamotorsclubSpider.last_page = int(TeslamotorsclubSpider.last_page)
TeslamotorsclubSpider.last_page -= 1
yield response.follow(next_page, callback=self.parse)
我需要从最高的页面刮到最低的项目。
请帮助我谢谢我使用下一个算法来解决它: 从第一页开始
url = url_page1
xpath_next_page = "//div[@class='pageNavLinkGroup']//a[@class='text' and contains(text(), 'Next')]"
加载第一页,完成你的工作,最后检查HTML和page+=1上是否存在XPATH。你的页面上有非常好的元素
链接[rel=next]
。所以您可以用这种方式重构代码:解析页面、调用下一步、解析页面、调用下一步,等等
def parse(self, response):
for item in response.css("[id^='fc-post-']"):
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"message":message,"datatime":datime}
next_page = response.css('link[rel=next]::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
UPD:下面是将数据从最后一页刮到第一页的代码:
class TeslamotorsclubSpider(scrapy.Spider):
name = 'teslamotorsclub'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
next_page = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-{}'
def parse(self, response):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
if last_page and int(last_page):
# iterate from last page down to first
for i in range(int(last_page), 0, -1):
url = self.next_page.format(i)
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
# parse data on page
for item in response.css("[id^='fc-post-']"):
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').get()
datime = item.css("a.datePermalink span::attr(title)").get()
message = item.css('div.messageContent blockquote').extract()
datime = parser.parse(datime)
yield {"last_page":last_page,"message":message,"datatime":datime}
如果从最后一页到第一页,请尝试以下操作:
class TeslamotorsclubSpider(scrapy.Spider):
name = 'teslamotorsclub'
start_urls = ['https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/']
page_start = 'https://teslamotorsclub.com/tmc/threads/tesla-tsla-the-investment-world-the-2019-investors-roundtable.139047/page-{}'
cbool = False
def parse(self, response):
if not self.cbool:
last_page = response.xpath('//div[@class = "PageNav"]/@data-last').extract_first()
self.cbool = True
yield response.follow(self.page_start.format(int(last_page)), callback=self.parse)
else:
for item in response.css("[id^='fc-post-']"):
message = item.css('div.messageContent blockquote::text').extract()
yield {"message":message}
prev_page = response.css("[class='PageNav'] a:contains('Prev')::attr('href')").get()
yield {"prev_page":prev_page} #Check it whether it is working
if prev_page:
yield response.follow(prev_page, callback=self.parse)
2268是动态的,我不能经常口述2268,因此请查看更新。它现在应该可以工作了@Christian Read。嗨@SIM这一个工作得很好谢谢你帮助meSpot,你没有将这个答案标记为已接受。很有趣,你发现了其他问题吗?是的,很抱歉,我仍然需要从页面的末尾开始,因为页面有2000多页,我有if语句,需要过滤最新的24小时帖子。如果我从第一页(最老的帖子)开始,我花了好几次时间才完成我的文章。有什么好主意吗?我@vezunchik对你的代码的逻辑有点问题,发现了一些问题,但我给了你1,感谢你。嗨@Wonka,我仍然需要在最后一页开始刮,我不能从第1页开始,因为这是浪费时间,它有2000多页