Scrapy 当只遵循一条规则时，如何修复粗糙规则_Scrapy_Web Crawler

Scrapy 当只遵循一条规则时，如何修复粗糙规则

scrapy web-crawler

Scrapy 当只遵循一条规则时，如何修复粗糙规则,scrapy,web-crawler,Scrapy,Web Crawler,此代码不起作用： name="souq_com" allowed_domains=['uae.souq.com'] start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"] rules = ( #categories Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")

此代码不起作用：

name="souq_com"
allowed_domains=['uae.souq.com']
start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"]

rules = (
    #categories
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'),unique=True)),
    Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div/a'),unique=True),callback='parse_item'),
    Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'),unique=True)),
)

第一条规则是获取响应，但第二条规则不起作用。我确信第二个规则xpath是正确的（我已经使用scrapy shell尝试过了），我还尝试过向第一个规则添加回调，并选择第二个规则的路径（'//div[@id=“ItemResultList”]/div/div/div/a'），然后发出请求，它工作正常

我还尝试了一种解决方法，我尝试使用基本爬行器而不是爬行爬行器，它只发出第一个请求，不发出回调。

我该如何解决这个问题

规则的顺序很重要。根据爬行蜘蛛

规则

：

如果多个规则匹配同一链接，将根据在此属性中定义的顺序使用第一个规则

如果我遵循中的第一个链接，即，您希望遵循的项目在此结构中

<div id="body-column-main">
    <div id="box-ads-souq-1340" class="box-container ">...
    <div id="box-results" class="box-container box-container-none ">
        <div class="box box-style-none box-padding-none">
            <div class="bord_b_dash overhidden hidden-phone">
            <div class="item-all-controls-wrapper">
            <div id="ItemResultList">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                <div class="single-item-browse fl width-175 height-310 position-relative">
                ...

另一个选项是将类别页面的第一条规则更改为限制性更强的XPath，该XPath在各个类别页面中不存在，例如
'//div[@id=“body column main”]//div[contains（@class，“fl”）//ul ul[@class=“refinementBrowser mainList”]

你也可以为分类页面定义一个正则表达式，并在你的规则中使用
accept
参数。
它起作用了，但是你知道为什么我的规则错了吗？并不是你的规则错了。我还没有查看你页面中的详细信息，但是如果链接同时符合这两个规则，那么规则的顺序很重要。请参阅“如果多个规则匹配同一链接，将根据它们在此属性中定义的顺序使用第一个规则。”@Vandel，我进一步查看了，并添加了关于这两个规则为什么匹配类别页面中的相同链接的解释
name="souq_com" allowed_domains=['uae.souq.com'] start_urls=["http://uae.souq.com/ae-en/shop-all-categories/c/"] rules = ( Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="ItemResultList"]/div/div/div')), callback='parse_item'), #categories Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="body-column-main"]//div[contains(@class,"fl")]'))), Rule(SgmlLinkExtractor(allow=(r'.*?page=\d+'))), )