Python 用scrapy刮网论坛不产生下一页_Python_Scrapy_Web Crawler

Python 用scrapy刮网论坛不产生下一页

python scrapy web-crawler

Python 用scrapy刮网论坛不产生下一页,python,scrapy,web-crawler,Python,Scrapy,Web Crawler,为了清楚起见，我尝试对有关赌场的论坛进行爬网，目前，我已成功地使用以下相同的方案进行爬网： class test_spider(scrapy.Spider): count=0 name = "test_spyder" start_urls = [ 'https://casinogrounds.com/forum/search/?&q=Casino&search_and_or=or&sortby=relevancy', ] rules = ( Rul

为了清楚起见，我尝试对有关赌场的论坛进行爬网，目前，我已成功地使用以下相同的方案进行爬网：

class test_spider(scrapy.Spider):
count=0

name = "test_spyder"

start_urls = [

       'https://casinogrounds.com/forum/search/?&q=Casino&search_and_or=or&sortby=relevancy',

]

rules = ( Rule(LinkExtractor(restrict_css=('a:contains("Next")::attr(href)')), callback='parse') )


def parse(self, response) :
    print(self.count)
    for href in response.css("span.ipsType_break.ipsContained a::attr(href)") :
        new_url = response.urljoin(href.extract())
        #print(new_url)
        yield scrapy.Request(new_url, callback = self.parse_review)


    next_page = response.css('a:contains("Next")::attr(href)').extract_first()
    print(next_page)
    if next_page is not None:
        yield scrapy.Request(next_page, callback = self.parse)

def parse_review(self, response):

    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    for review in response.css('article.cPost.ipsBox.ipsComment.ipsComment_parent.ipsClearfix.ipsClear.ipsColumns.ipsColumns_noSpacing.ipsColumns_collapsePhone') :

        yield {
            'name': review.css('strong a.ipsType_break::text').extract_first(),
            'date': review.css('time::attr(title)').extract_first(),
            'review': review.css('p::text').extract(),
            'url' : response.url
        }


    next_page = response.css('li.ipsPagination_next a::attr(href)').extract_first()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse_review)

因此，当我在python脚本中执行spider时，通常（我指的是其他论坛），它会从起始url抓取所有页面的所有线程

但对于这个页面，它不会，它只会删除第一个页面的所有线程，它会获得进入第二个页面的正确URL，但下次确实会调用parse函数

当然，如果我把所有页面的URL放在开始URL列表中，它会删除所有页面

感谢您的帮助。

您收到的HTTP 429响应意味着该站点正在限制您的请求，以避免被淹没。您可以使用将请求的频率限制在网站允许的范围内。

当您说它获得了正确的URL时，您是说

print（next_page）

显示了正确的URL吗？是的，它给了我一个完全正确的答案：并且它没有

print（self.count）

用于该页面？否，从开始的URL是self.count，然后通过打印（下一页）显示右下一个URL，但随后刮削结束。我还将注意到，有时它会转到第二页，然后从这里结束。最后一次运行时，我看到：>>>>>>开始刮片。。。。。刮擦结束！