Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/306.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮屑SGMLLinkedExtractor在10页后停止_Python_Web Crawler_Scrapy - Fatal编程技术网

Python 刮屑SGMLLinkedExtractor在10页后停止

Python 刮屑SGMLLinkedExtractor在10页后停止,python,web-crawler,scrapy,Python,Web Crawler,Scrapy,目前,我对SGMLLinkedExtractor的规则如下: rules = (Rule (SgmlLinkExtractor(allow=("/boards/recentnews.aspx", ),restrict_xpaths= ('//*[text()[contains(.,"Next")]]')) , callback="parse_start_url", follow= True), ) 我希望scrapy在到达第10页后停止爬

目前,我对SGMLLinkedExtractor的规则如下:

     rules = (Rule (SgmlLinkExtractor(allow=("/boards/recentnews.aspx", ),restrict_xpaths=        ('//*[text()[contains(.,"Next")]]'))
        , callback="parse_start_url", follow= True),
        )
我希望scrapy在到达第10页后停止爬行,所以我想它应该是这样的:

     rules = (Rule (SgmlLinkExtractor(allow=("/boards/recentnews.aspx?page=\d*", ),restrict_xpaths=        ('//*[text()[contains(.,"Next")]]'))
        , callback="parse_start_url", follow= True),
        )

但是我不知道怎么做,规则适用于1-10。

您可以在回调中执行:

def parse_start_url(response):
    page_number = int(re.search('page=(\d+)', response.url).group(1))
    if page_number > 10:
        raise CloseSpider('page number limit exceeded')
    # scrape the data
下面是包含正则表达式的行的作用:

>>> import re
>>> url = "http://example.com/boards/recentnews.aspx?page=9"
>>> re.search('page=(\d+)', url).group(1)
'9'
>>> url = "http://example.com/boards/recentnews.aspx?page=10"
>>> re.search('page=(\d+)', url).group(1)
'10'

谢谢,我想有办法把它包括在规则中。