Python 如何动态设置刮擦规则?

Python 如何动态设置刮擦规则?,python,scrapy,Python,Scrapy,我有一个类在初始化之前运行一些代码: class NoFollowSpider(CrawlSpider): rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items", follow= True), ) def __init__(self, moreparams=None, *args, **kwargs): super(NoFollowSpider, sel

我有一个类在初始化之前运行一些代码:

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),
)

def __init__(self, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    self.moreparams = moreparams
我正在使用以下命令运行此scrapy代码:

> scrapy runspider my_spider.py -a moreparams="more parameters" -o output.txt 
现在,我希望可以从命令行配置名为rules的静态变量:

> scrapy runspider my_spider.py -a crawl=True -a moreparams="more parameters" -o output.txt
init更改为:

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    if (crawl_pages is True):
        self.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),
    )
    self.moreparams = moreparams
但是,如果我在init中切换静态变量规则,scrapy将不再考虑它:它将运行,但只爬行给定的start_URL,而不是整个域。似乎规则必须是静态类变量


那么,如何动态设置静态变量呢?

嗯,您有两个选择。更简单的一个-我不确定它是否会工作,但只是使用类而不是构造函数中的
self
来设置规则:

def __init__(self, session_id=-1, crawl_pages=False, allowed_domains=None, start_urls=None, xpath=None, contains = None, doesnotcontain=None, *args, **kwargs):

    #You simply set the class member from here
    NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                callback="parse_items",  follow= True),)
我不确定scrapy是否会尊重这一点——这取决于它何时阅读这些规则。但值得一试

另一种更复杂的方法是使用元类。基本上,您可以干预类的创建方式,而不仅仅是它的实例。请注意,元类“
\uuuuu new\uuuu
方法发生在导入时间上,在运行任何代码之前

class MyType(type):
    """
    A Meta class that creates classes 
    """
    @staticmethod
    def __new__(cls, name, bases, dict):
        ret = type.__new__(cls, name, bases, dict)

        # whatever you want to do - do it here. You can peek into
        # the command line args for example
        ret.rules = (....)
        return ret


class MyClass(object):
    """
    Now comes the actual class, with the __metaclass__ identifier.
    This means that when we create the class definition we call the metaclass' __new__
    """ 
    __metaclass__ = MyType

    def __init__(self):
        pass
在您定义规则之前,先对其进行定义

如何动态设置静态变量

我不知道scrapy,但是有什么原因不能只使用类方法吗

class NoFollowSpider(CrawlSpider):
    rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
    @classmethod
    def set_rules(klass,rules)
        klass.rules = rules
请注意,
规则
不是一个静态变量,而是一个静态变量


编辑-这里有另一种可能在一开始就设置它的方法。应该可以让您避免执行
\u compile\u rules(),
,而且我认为它更简洁:

class NoFollowSpider(CrawlSpider):
    def __new__(klass, crawl_pages=False, moreparams=None, *args, **kwargs):
        if crawl_pages:
            klass.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),\
            callback="parse_items",  follow= True),)
        return super(NoFollowSpider,klass).__new__(klass,*args,**kwargs)
    def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
        super(NoFollowSpider, self).__init__(*args, **kwargs)
        self.moreparams = moreparams

下面是我在@Not_a_Golfer和@nramirezuy的大力帮助下解决问题的方法,我只是简单地使用了他们建议的两种方法:

class NoFollowSpider(CrawlSpider):

def __init__(self, crawl_pages=False, moreparams=None, *args, **kwargs):
    super(NoFollowSpider, self).__init__(*args, **kwargs)
    # Set the class member from here
    if (crawl_pages is True):
        NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),), callback="parse_items",  follow= True),)
        # Then recompile the Rules
        super(NoFollowSpider, self)._compile_rules()

    # Keep going as before
    self.moreparams = moreparams

谢谢大家的帮助

我正在用Scrapy 1.0做这件事,它很有效。请注意,您只能在初始Spider实例化上信任kwargs

    class LinuxFoundationSpider(CrawlSpider):
        year = None

        def __init__(self, category=None, *args, **kwargs):
           monthly_thread_xpath = 'date\.html'
        if kwargs.get('year'):
            LinuxFoundationSpider.year = kwargs['year']
        if LinuxFoundationSpider.year:
            monthly_thread_xpath = '%s.*?(\\/date\\.html)' % LinuxFoundationSpider.year

        LinuxFoundationSpider.rules = (
            Rule(LinkExtractor(allow=(monthly_thread_xpath,))),
            Rule(LinkExtractor(restrict_xpaths=('//ul[2]/li/a[1]',)),
                               callback='parse_entry', follow=False),
        )
    super(LinuxFoundationSpider, self).__init__(*args, **kwargs)

您可以使用元类来实现这一点,元类负责类本身的实例化,而不是它的实例。这就是你想要的方向吗?不知道你在问什么。您的代码在init之前没有运行,这正是您粘贴的init函数。详细说明。@BartoszKP我澄清了这个问题,现在让我知道它是否更有意义。@Not_a_Golfer的确,我认为我需要朝那个方向走,我将感谢任何指示或资源。@antoinet是的,看起来好多了,谢谢!第一个选项不起作用,因为-如另一个答案中所述-规则在初始化时在父类中编译。但是你帮了我很多,帮我找到了正确的答案!谢谢你!谢谢你的回答,它帮助我找到了解决问题的方法!这是因为@nramirezuy在另一个答案中解释说,规则是由Scrapy在启动时编译的对不起,给你的只是一个尴尬的错误:-(如果我不配,我不想要它!请看我编辑过的答案。我相信这就是你应该这样做的。非常有趣,是的,更干净。但是返回TypeError:new_u;()至少需要1个参数(给定0)我不明白为什么,因为用*args和**kwargs调用uu new…我认为这里的诀窍是,在方法调用的每一个结尾,你实际上都在调用父对象上的super。这样做让我们有机会事先设置变量。我这样做了,效果正如预期。谢谢。尽管我对选择python和s感到遗憾糟透了。我在这上面浪费了两个小时。本可以在哈斯克尔轻松完成的。
class NoFollowSpider(CrawlSpider):
    def __init__(self, crawl_pages=False, moreparams=None, *a, **kw):
        if (crawl_pages is True):
            NoFollowSpider.rules = ( Rule (SgmlLinkExtractor(allow=("", ),),
                                           callback="parse_items",  follow= True),)

        # No need to call "_compile_rules()" manually, it's called in __init__ of the parent
        super(NoFollowSpider, self).__init__(*a, **kw)

        # Keep going as before
        self.moreparams = moreparams