Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scrapy LinkedExtractor-遵循哪个正则表达式?_Python_Regex_Web Scraping_Scrapy - Fatal编程技术网

Python Scrapy LinkedExtractor-遵循哪个正则表达式?

Python Scrapy LinkedExtractor-遵循哪个正则表达式?,python,regex,web-scraping,scrapy,Python,Regex,Web Scraping,Scrapy,我试图从amazon上抓取一个类别,但我在Scrapy中获得的链接与浏览器中的链接不同。现在,我正试图跟随下一页的线索,在Scrapy(打印成txt文件的response.body)中,我看到了以下链接: <span class="pagnMore">...</span> <span class="pagnLink"><a href="/s?ie=UTF8&page=4&rh=n%3A2619533011%2Ck%3Apet%20supp

我试图从amazon上抓取一个类别,但我在Scrapy中获得的链接与浏览器中的链接不同。现在,我正试图跟随下一页的线索,在Scrapy(打印成txt文件的
response.body
)中,我看到了以下链接:

<span class="pagnMore">...</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=4&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >4</a></span>
<span class="pagnCur">5</span>
<span class="pagnLink"><a href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011" >6</a></span>
<span class="pagnMore">...</span>
<span class="pagnDisabled">20</span>
<span class="pagnRA"> <a title="Next Page"
                   id="pagnNextLink"
                   class="pagnNext"
                   href="/s?ie=UTF8&page=6&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011">
<span id="pagnNextString">Next Page</span>
如果我摆脱了规则或者做了类似于
“^http.*”
的事情,它会起作用,但它会遵循一切。
我在这里做错了什么?

请尝试仅检查
页面
参数:

Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_items", follow= True),

规则(SgmlLinkExtractor(allow=r“page=\d+”),callback=“parse\u items”,follow=True),
工作并对所有内容进行爬网<代码>规则(SgmlLinkExtractor(allow=r“page=\d+”,restrict\xpaths='/*[@id=“pagnNextLink”]”),callback=“parse\u item”,follow=True),工作,但抓取0个页面并立即停止。我想以某种方式限制爬网。@Chris我会找到分页容器/块并将其用于限制。
Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_items", follow= True),