Python 刮：从分页中刮取数据_Python_Xpath_Web Scraping_Scrapy

Python 刮：从分页中刮取数据

python xpath web-scraping scrapy

Python 刮：从分页中刮取数据,python,xpath,web-scraping,scrapy,Python,Xpath,Web Scraping,Scrapy,到目前为止，我已经从一个页面抓取了数据。我想继续到分页结束查看页面似乎有问题，因为href包含javascript元素 <a href="javascript:void(0)" class="next" data-role="next" data-spm-anchor-id="a2700.galleryofferlist.pagination.8">Next</a> 问题如何解决分页问题你能帮忙吗请帮助我修改代码，这样我就可以按照分页链接并将数据刮取到最

到目前为止，我已经从一个页面抓取了数据。我想继续到分页结束

查看页面

似乎有问题，因为href包含javascript元素

<a href="javascript:void(0)" class="next" data-role="next" data-spm-anchor-id="a2700.galleryofferlist.pagination.8">Next</a>

问题

如何解决分页问题

你能帮忙吗

请帮助我修改代码，这样我就可以按照分页链接并将数据刮取到最后

要查找和解析类别中的所有页面，可以使用以下方法：

import re
import requests
base_url = "https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page="
resp = requests.get(base_url)

try :
    n_pages = re.findall(r'"pagination":\{\s+"total":(.*?),', resp.text , re.IGNORECASE)
    if n_pages:
        for page in range(1, int(n_pages[0]) + 1):
            url = "{}{}".format(base_url, page)
            # do the parsing in this block using the dynamic generated url's
            # https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1
            # ...
            # https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=68

except Exception as e:
    print ("Cannot find/parse the total number of pages", e)
    # general except, needs improvment in error handling

您可以使用类似的代码获取下一页URL：

next_page_url = response.xpath('//div[@class="ui2-pagination-pages"]/span[@class="current"]/following-sibling::a[1][contains(@href, "?page=")]/@href').extract_first()

但这不起作用，因为分页块是由Javascript呈现的：-(

但是你可以使用一些技巧：

next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()

当我查看页面源代码时，我可以看到链接标记，但在shell中尝试查询时，它似乎不起作用。有什么问题吗？``遵循分页链接`

next\u page\u url=response.xpath（'//link[@rel=“next”]/@href'）。提取\u first（）

，如果下一页\u url:

产生碎片。请求（url=next\u page\u url，callback=self.parse）

当我使用爬行器时，这不起作用。有什么可能不正确吗？好主意。我想动态执行此操作，但我对Scrapy的知识很薄弱。我可以实现此解决方案，如果有更动态的方法，我会随着时间的推移解决它。谢谢！

next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()