Python 如何在scrapy中使用规则类

Python 如何在scrapy中使用规则类,python,web-crawler,scrapy,scrapy-spider,Python,Web Crawler,Scrapy,Scrapy Spider,我正在尝试使用规则类转到爬虫程序的下一页。这是我的密码 from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from crawler.items import GDReview class GdSpider(CrawlSpider): name = "gd" allowed_domains = [

我正在尝试使用规则类转到爬虫程序的下一页。这是我的密码

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview


class GdSpider(CrawlSpider):
    name = "gd"
    allowed_domains = ["glassdoor.com"]
    start_urls = [
        "http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
    ]

    rules = (

        # Extract next links and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]/a/@href',)), follow= True)
    )


    def parse(self, response):
        company_name = response.xpath('//*[@id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()

        '''loop over every review in this page'''
        for sel in response.xpath('//*[@id="EmployerReviews"]/ol/li'):
            review = Item()
            review['company_name'] = company_name
            review['id'] = str(sel.xpath('@id').extract()[0]).split('_')[1] #sel.xpath('@id/text()').extract()
            review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
            review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
            review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()

            yield review
我的问题是关于规则部分。在此规则中,提取的链接不包含域名。例如,它将返回如下内容 “/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm”

如何确保我的爬虫程序将域附加到返回的链接


谢谢

您可以确定,因为这是Scrapy()中链接提取器的默认行为

另外,
restrict\u xpaths
参数不应指向
@href
属性,而应指向
a
元素或包含
a
元素作为子体的容器。另外,
restrict\u xpath
可以定义为字符串

换言之,替换:

restrict_xpaths=('//li[@class="next"]/a/@href',)
与:


此外,您需要从以下位置切换到
lxmlinkextractor

基于SGMLParser的链接提取器是未指定的,其用法是 气馁。如果需要,建议迁移到lxmlinkextractor 仍在使用SgmlLinkExtractor

就我个人而言,我通常使用
LinkedExtractor
快捷方式访问
lxmlLinkedExtractor

from scrapy.contrib.linkextractors import LinkExtractor

总而言之,这就是我在<代码>规则中的内容:

rules = [
    Rule(LinkExtractor(restrict_xpaths='//li[@class="next"]/a'), follow=True)
]
“plus,restrict\u xpath可以定义为字符串”是什么意思??它们必须是有效的XPath,对吗?
rules = [
    Rule(LinkExtractor(restrict_xpaths='//li[@class="next"]/a'), follow=True)
]