Python 将参数传递给Scrapy中允许的_域

Python 将参数传递给Scrapy中允许的_域,python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我正在创建一个爬虫程序,它接受用户输入并对站点上的所有链接进行爬虫。然而,我需要限制抓取和提取的链接,从该域只有链接,没有外部域。我把它带到了我需要的地方。我的问题是,对于我的allows_domains函数,我似乎无法通过命令传入scrapy选项。下面是要运行的第一个脚本: # First Script import os def userInput(): user_input = raw_input("Please enter URL. Please do not include

我正在创建一个爬虫程序,它接受用户输入并对站点上的所有链接进行爬虫。然而,我需要限制抓取和提取的链接,从该域只有链接,没有外部域。我把它带到了我需要的地方。我的问题是,对于我的allows_domains函数,我似乎无法通过命令传入scrapy选项。下面是要运行的第一个脚本:

# First Script
import os

def userInput():
    user_input = raw_input("Please enter URL. Please do not include http://: ")
    os.system("scrapy runspider -a user_input='http://" + user_input + "' crawler_prod.py")

userInput()
它运行的脚本是爬虫程序,爬虫程序将对给定的域进行爬网。以下是爬虫程序代码:

#Crawler
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import Request
from scrapy.http import Request

class InputSpider(CrawlSpider):
        name = "Input"
        #allowed_domains = ["example.com"]

        def allowed_domains(self):
            self.allowed_domains = user_input

        def start_requests(self):
            yield Request(url=self.user_input)

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")
我已经尝试过通过终端命令发送请求,但是这会使爬虫程序崩溃。我现在如何拥有它也会使爬虫崩溃。我也尝试过只输入
允许的\u域=[user\u input]
,它向我报告它没有定义。我正在玩弄来自Scrapy的请求库,以使它在没有运气的情况下工作。有没有更好的方法来限制给定域之外的爬网

编辑:

这是我的新代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spiders import BaseSpider
from scrapy import Request
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse
#from run_first import *

class InputSpider(CrawlSpider):
        name = "Input"
        #allowed_domains = ["example.com"]

        #def allowed_domains(self):
            #self.allowed_domains = user_input

        #def start_requests(self):
            #yield Request(url=self.user_input)

        def __init__(self, *args, **kwargs):
            inputs = kwargs.get('urls', '').split(',') or []
            self.allowed_domains = [urlparse(d).netloc for d in inputs]
            # self.start_urls = [urlparse(c).netloc for c in inputs] # For start_urls

        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")
这是新代码的输出日志

2017-04-18 18:18:01 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2017-04-18 18:18:01 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-04-18 18:18:01 [scrapy] INFO: Overridden settings: {'LOG_FILE': 'output.log'}
2017-04-18 18:18:43 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2017-04-18 18:18:43 [scrapy] INFO: Optional features available: ssl, http11, boto
2017-04-18 18:18:43 [scrapy] INFO: Overridden settings: {'LOG_FILE': 'output.log'}
2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
  from scrapy.contrib.spiders import CrawlSpider, Rule

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
  from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

2017-04-18 18:18:43 [py.warnings] WARNING: /home/****-you/Python_Projects/Network-Multitool/crawler/crawler_prod.py:27: ScrapyDeprecationWarning: SgmlLinkExtractor is deprecated and will be removed in future releases. Please use scrapy.linkextractors.LinkExtractor
  Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')

2017-04-18 18:18:43 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2017-04-18 18:18:43 [boto] DEBUG: Retrieving credentials from metadata server.
2017-04-18 18:18:44 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2017-04-18 18:18:44 [boto] ERROR: Unable to read instance data, giving up
2017-04-18 18:18:44 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-04-18 18:18:44 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-04-18 18:18:44 [scrapy] INFO: Enabled item pipelines: 
2017-04-18 18:18:44 [scrapy] INFO: Spider opened
2017-04-18 18:18:44 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-18 18:18:44 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-18 18:18:44 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: 
2017-04-18 18:18:44 [scrapy] INFO: Closing spider (finished)
2017-04-18 18:18:44 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 18, 22, 18, 44, 794155),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 3,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2017, 4, 18, 22, 18, 44, 790331)}
2017-04-18 18:18:44 [scrapy] INFO: Spider closed (finished)

这里你少了一些东西

  • 来自start_URL的第一个请求未被筛选
  • 一旦运行开始,您就不能覆盖
    允许的\u域
    要解决这些问题,您需要编写自己的OfficeST中间件,或者至少根据需要修改现有的中间件

    一旦爬行器打开,处理
    允许的\u域
    OffsiteMiddleware
    允许的\u域
    值转换为正则表达式字符串,然后该参数将不再使用

    向您的Middleware.py添加类似的内容:

    from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
    from scrapy.utils.httpobj import urlparse_cached
    class MyOffsiteMiddleware(OffsiteMiddleware):
    
        def should_follow(self, request, spider):
            """Return bool whether to follow a request"""
            # hostname can be None for wrong urls (like javascript links)
            host = urlparse_cached(request).hostname or ''
            if host in spider.allowed_domains:
                return True
            return False
    
    SPIDER_MIDDLEWARES = {
        # enable our middleware
        'myspider.middlewares.MyOffsiteMiddleware': 500,
        # disable old middleware
        'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, 
    
    }
    
    设置.py中激活它:

    from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
    from scrapy.utils.httpobj import urlparse_cached
    class MyOffsiteMiddleware(OffsiteMiddleware):
    
        def should_follow(self, request, spider):
            """Return bool whether to follow a request"""
            # hostname can be None for wrong urls (like javascript links)
            host = urlparse_cached(request).hostname or ''
            if host in spider.allowed_domains:
                return True
            return False
    
    SPIDER_MIDDLEWARES = {
        # enable our middleware
        'myspider.middlewares.MyOffsiteMiddleware': 500,
        # disable old middleware
        'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, 
    
    }
    
    现在,您的spider应该遵循您在
    允许的\u域中拥有的任何内容,即使您在运行中对其进行了修改

    编辑:对于您的案例:

    from scrapy.utils.httpobj import urlparse
    class MySpider(Spider):
        def __init__(self, *args, **kwargs):
            input = kwargs.get('urls', '').split(',') or []
            self.allowed_domains = [urlparse(d).netloc for d in input]
    
    现在您可以运行:

    scrapy crawl myspider -a "urls=foo.com,bar.com"
    

    我建议您研究一下:它从输入参数中设置
    允许的\u域。它不是一个
    爬行爬行器
    ,但您可能可以将其用作用例的基础。这看起来是一个很好的解决方案,可以在终端中调用scrapy命令之前添加允许的_域值。我的问题是,我正在设置它,以便用户可以进入一个域,该域将被爬网。我的问题是将起始url和允许的\u域都传递到scrapy脚本中。因此,我如何输入起始url是
    scrapy runspider-a user\u input=”http://quotes.toscrape.com“crawler.py
    。我只是在将允许的域限制为用户输入的域时遇到问题,并且无法访问任何外部链接。@George您只需更改
    self。允许的\u域
    以包含这些URL的
    netloc
    。查看我的编辑。我现在正在玩它,我运行命令
    scrapy runspider crawler\u prod.py-a url=”http://quotes.toscrape.com,quotes.toscrape.com“
    ,我在上面添加了您的编辑。我仍然在玩它,但是我得到了
    urleror:
    errors。有什么建议吗?@George你能把你的爬网日志贴出来吗?如果没有上下文,很难说这个错误是什么意思。要输出日志,您可以使用
    scrapy crawl spider-s log\u FILE=output.log
    scrapy crawl spider&>output.log
    我实际上已经解决了这个问题。我会在上面发表我的编辑,但你的观点是正确的。非常感谢你的帮助!