Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/neo4j/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Can';不要更改刮擦蜘蛛的设置_Python_Scrapy_Settings - Fatal编程技术网

Python Can';不要更改刮擦蜘蛛的设置

Python Can';不要更改刮擦蜘蛛的设置,python,scrapy,settings,Python,Scrapy,Settings,我正在尝试从带有一些参数的请求启动一个scrapy爬虫程序: msg_req_obj = MessageRequestObject(azureServiceBus=self.azure_service_bus, sbReqQ=self.sb_request_queue, sbResQ=self.sb_response_queue,

我正在尝试从带有一些参数的请求启动一个scrapy爬虫程序:

 msg_req_obj = MessageRequestObject(azureServiceBus=self.azure_service_bus,
                                        sbReqQ=self.sb_request_queue,
                                        sbResQ=self.sb_response_queue,
                                        session=message_body['session'],
                                        studyName=message_body['studyName'], 
                                        studyId=message_body['studyId'], 
                                        strategyId=message_body['strategyId'],
                                        requestId=message_body['requestId'],
                                        email=message_body['email'],
                                        crawlDepth=message_body['crawlDepth'],
                                        crawlPageCount=message_body['crawlPageCount'],
                                        sites=site_obj_array,
                                        msg=message)
此消息基本上传递启动爬行器的第一个URL,以及将为每个创建的爬行器更改的两个设置:crawlDepthcrawlPageCount

我有以下获取设置的方法:

  • settings.py文件,其中包含爬行器的“默认”设置
  • config_settings.py文件,用于向整个项目添加设置,包括用于覆盖settings.py的一些设置

    def depth_limit(self):
    _default_depth_limit = 4
    if (self._depth_limit):
        try:
            return int(self._depth_limit)
        except Exception as ex:
            logger.error('"DEPTH_LIMIT" is not a number in application settings. Using default value "' + str(_default_depth_limit) + '"')
            return _default_depth_limit
    else:
        print('"DEPTH_LIMIT" not found/empty in application settings. Using default value "' + str(_default_depth_limit) + '"')
        return _default_depth_limit
    
  • custom\u settings
    在spider中,该spider使用
    config\u settings.py
    中的设置覆盖
    settings.py
    文件

  • 通过
    get\u project\u settings()
    获取设置,以检索默认设置,然后使用
    settings.update()
    方法更新设置,传递来自
    msg\u req\u obj
    的值,然后使用这些更新的设置启动
    爬网程序
最后一个有效地更改了将要传递给
crawlerRunner
的设置。但是爬行器不加载这些设置,而是从
深度\u限制=1
开始

我尝试在所有其他方法中为深度限制硬编码不同的值(
settings.py
config\u settings.py
custom\u settings
),但它们似乎都不起作用,因为爬行器在停止和关闭之前总是爬到深度为1的项目。因此,爬行器似乎没有采用这些设置中的任何一个,并且“默认”为DEPTH_LIMIT=1

我错过了什么?为了能够实现这一点,我是否应该采取其他步骤

编辑:

下面是我的
crawlProcess
类的代码:

class CrawlProcess(object):
    """description of class"""

    def __init__(self, blob_service, blob_service_output_container_name):
        """
        Constructor
        """        
        self.blob_service = blob_service
        self.blob_service_output_container_name = blob_service_output_container_name
        settings_file_path = 'scrapy_app.scrapy_app.settings' # The path seen from root, ie. from crawlProcess.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.runner = ''

    def spider_closing(self, spider):
        """Activates on spider closed signal"""
        print("STOPPING SPIDER")
        self.runner.join()

    def crawl_sites(self, blob_config, site_urls, msg_req_obj):
        print("SPIDER STARTED")
        print(site_urls)
        s = get_project_settings()
        s.update({
            "DEPTH_LIMIT" : msg_req_obj.crawlDepth,
            "MAX_RESPONSES_TO_CRAWL" : msg_req_obj.crawlPageCount,
        })        
        self.runner = CrawlerRunner(s)

        self.runner.crawl(GenericSpider, 
                    blobConfig=blob_config, 
                    msgReqObj=msg_req_obj,
                    urls=site_urls)

        deferred = self.runner.join()
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run()

    def start_process(self, site_urls, msg_req_obj):
        blob_config = BlobConfig(blob_service=self.blob_service, blob_container_name=self.blob_service_output_container_name,)

        crawl_sites_process = mp.Process(target=self.crawl_sites, args=(blob_config, site_urls, msg_req_obj), daemon=True)

        print("STARTING SPIDER")
        crawl_sites_process.start()
        crawl_sites_process.join()
        print("SPIDER STOPPED")
        print("ENGINE STOPPED")   

以下是我的
GenericSpider
的代码:

import scrapy
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlparse
try:
    from scrapy_app.scrapy_app.items import HtmlItem
except ImportError:
    from scrapy_app.items import HtmlItem

import re
import os
import json
from scrapy_splash.response import SplashJsonResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy_app.scrapy_app.utils import get_domain_from_url, get_subdomain_from_url
import logging
from config_settings import ConfigurationSettings
from scrapy_splash import SplashRequest

logger = logging.getLogger(__name__)

class GenericSpider(CrawlSpider):
    extractor = LinkExtractor()
    crawl_depth = 0
    name = 'generic'
    configurationSettings = ConfigurationSettings.getInstance()
    handle_httpstatus_list = configurationSettings.handle_http_statuses
    handle_httpstatus_all = configurationSettings.handle_all_http_statuses
    custom_settings = {
        'ROBOTSTXT_OBEY': configurationSettings.obey_robotstxt,
        'DEPTH_LIMIT': configurationSettings.depth_limit,
        'DOWNLOAD_DELAY': configurationSettings.download_delay_for_requests,
        'CLOSESPIDER_PAGECOUNT': configurationSettings.max_responses_to_crawl
    }


    logger.setLevel(logging.INFO)
    logging.basicConfig(
           filename='scraping.log',
           format='%(levelname)s: %(message)s',
           level=logging.INFO
       )

    def __init__(self, crawler, *args, **kwargs):
        self.crawler = crawler
        self.blobConfig = kwargs.get('blobConfig')
        self.msgReqObj = kwargs.get('msgReqObj')
        self.urls = kwargs.get('urls')
        self.allowed_domains = [urlparse(url).netloc for url in self.urls]
        self.start_urls = self.urls
        self.proxy_pool = self.configurationSettings.proxies

        self.suggestedKeywords = self.configurationSettings.suggested_keywords


        self.rules = [Rule(LinkExtractor(allow=(), allow_domains=self.allowed_domains,
                                         canonicalize=True, unique=True,), 
                           follow=True, callback="parse_item", process_request="use_splash_request"),]

        self._follow_links = True
        self._domain = ""

        self.failed_urls_dict = {}
        for httpstatus in self.handle_httpstatus_list:
            self.failed_urls_dict[httpstatus] = []


        super(GenericSpider, self).__init__(crawler, *args, **kwargs)


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        # settings = crawler.settings
        return cls(crawler, *args, **kwargs)

    def parse_item(self, response):

        # if self.handle_httpstatus_all or response.status not in self.handle_httpstatus_list:              # Without this line, ALL HTTP Responses are handleds
        item = self._get_item(response)
        yield item

def _get_item(self, response):

        children = []
        # Get parameters from the Response
        _domain = response.meta['url_domain'] if 'url_domain' in response.meta else get_domain_from_url(response.url)
        _subdomain = get_subdomain_from_url(response.url, _domain)
        _comparableId = response.meta['comparableId'] if 'comparableId' in response.meta else 'NA'
        root = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(response.url))
        _html = response.text
        base_tag = response.css("head base").extract()
        if not base_tag:
            _html = _html.replace("</head>", "<base href=\"" + root + "/\"></head>")

        #Populate Child pages List
        links = self.extractor.extract_links(response)
        [children.append(link.url) for link in links]


        item = HtmlItem(
            url=response.url,
            domain=_domain,
            subdomain=_subdomain,
            html=_html,
            description='',
            title='',
            is_suggested=str(False),
            comparable_id=str(_comparableId),
            is_error=str(False) if 200 <= response.status < 300 else str(True),
            http_status=response.status,
            crawl_depth=response.meta['depth'],
            child_pages=children
        )

        self._set_title(item, response)
        self._set_description(item, response)
        self._is_suggested(item)

        return item

    def _set_title(self, item, response):
        if isinstance(response, SplashJsonResponse) or response.meta['isFirstPage'] == True:
            title = response.css("title::text").extract()
            if title:
                item['title'] = title[0].encode("utf-8")
        else: 
            pass

    def _set_description(self, item, response):
        if isinstance(response, SplashJsonResponse):
            meta_description = response.css("meta[name=\"description\"]::attr(content)").extract()
            if meta_description:
                item['description'] = meta_description[0].encode("utf-8")

    def _is_suggested(self, item):
        #logger.info('TITLE-DESCRIPTION:- %(title)s ==> %(desc)s', {'title': item['title'], 'desc': item['description']})
        _title = item['title'].decode("utf-8") if item['title'] else ''
        _description = item['description'].decode("utf-8") if item['description'] else ''
        try :
            if any(re.search(r'\b' + sug_kwd + r'\b', _title, re.IGNORECASE) for sug_kwd in self.suggestedKeywords) \
              or any(re.search(r'\b' + sug_kwd + r'\b', _description, re.IGNORECASE) for sug_kwd in self.suggestedKeywords):
                item['is_suggested'] = str(True)
        except Exception as ex:
            template = "GenericSpider:- An exception of type {0} occurred. Arguments:\n{1!r}"
            ex_message = template.format(type(ex).__name__, ex.args)
            print(ex_message)

import scrapy
从scrapy.LinkExtractor导入LinkExtractor
从urllib.parse导入urlparse
尝试:
从scrapy_app.scrapy_app.items导入HtmlItem
除恐怖外:
从scrapy_app.items导入HtmlItem
进口稀土
导入操作系统
导入json
从scrapy_splash.response导入SplashJsonResponse
从scrapy.spider导入爬行蜘蛛,规则
从scrapy_app.scrapy_app.utils导入从_url获取_域,从_url获取_子域
导入日志记录
从配置设置导入配置设置
从scrapy_splash导入splash请求
logger=logging.getLogger(_名称__)
类GenericSpider(爬行爬行器):
提取器=链接提取器()
爬网深度=0
名称='generic'
configurationSettings=configurationSettings.getInstance()
handle\u httpstatus\u list=configurationSettings.handle\u http\u status
handle\u httpstatus\u all=configurationSettings.handle\u all\u http\u status
自定义设置={
“ROBOTSTXT_-obile”:配置设置.obile_-ROBOTSTXT,
“深度限制”:配置设置。深度限制,
“下载延迟”:配置设置。为请求下载延迟,
“CLOSESPIDER\u PAGECOUNT”:configurationSettings.max\u对爬网的响应
}
logger.setLevel(logging.INFO)
logging.basicConfig(
filename='scraping.log',
格式='%(levelname)s:%(消息)s',
级别=logging.INFO
)
定义初始化(self,crawler,*args,**kwargs):
self.crawler=爬虫程序
self.blobConfig=kwargs.get('blobConfig')
self.msgReqObj=kwargs.get('msgReqObj')
self.url=kwargs.get('url'))
self.allowed_domains=[urlparse(url).netloc表示self.url中的url]
self.start_url=self.url
self.proxy\u pool=self.configurationSettings.proxy
self.suggestedKeywords=self.configurationSettings.suggested_关键字
self.rules=[规则(LinkExtractor(allow=(),allow_domains=self.allowed_domains,
canonicalize=True,unique=True,),
follow=True,callback=“parse\u item”,process\u request=“use\u splash\u request”),]
self.\u follow\u links=True
self._domain=“”
self.failed_url_dict={}
对于self.handle\u httpstatus\u列表中的httpstatus:
self.failed_url_dict[httpstatus]=[]
super(GenericSpider,self)。\uuuu init(crawler,*args,**kwargs)
@类方法
来自_爬虫程序的def(cls、爬虫程序、*args、**kwargs):
#设置=crawler.settings
返回cls(爬虫程序,*args,**kwargs)
def解析_项(自身、响应):
#如果self.handle_httpstatus_all或response.status不在self.handle_httpstatus_列表中:#没有此行,则所有HTTP响应都是handled
项目=自身。获取项目(响应)
收益项目
def_获取_项目(自我、响应):
儿童=[]
#从响应中获取参数
_domain=response.meta['url\u domain']如果response.meta中有'url\u domain',则从\u url(response.url)获取\u domain\u
_subdomain=从\u url(response.url,\u domain)获取\u subdomain\u
_comparableId=response.meta['comparableId']如果response.meta中的'comparableId'为'NA'
root='{uri.scheme}://{uri.netloc}'。格式(uri=urlparse(response.url))
_html=response.text
base_tag=response.css(“head base”).extract()
如果不是基本标签:
_html=\u html.replace(“,”)
#填充子页面列表
links=self.extractor.extract\u链接(响应)
[children.append(link.url)用于链接中的链接]
item=HtmlItem(
url=response.url,
域=_域,
子域=_子域,
html=html,
描述=“”,
标题=“”,
是否建议=str(错误),
可比id=str(_可比id),

is_error=str(False)如果200@Gallaecio谢谢,我编辑了问题并添加了爬行器和爬行过程的代码。我的观点是你应该创建
    def crawl_sites(self, blob_config, site_urls, msg_req_obj):
        print("SPIDER STARTED")
        print(site_urls)
        s = get_project_settings()
        s.update({
            "DEPTH_LIMIT" : msg_req_obj.crawlDepth,
            "MAX_RESPONSES_TO_CRAWL" : msg_req_obj.crawlPageCount,
        })        
        self.runner = CrawlerRunner(s)
        self.runner.crawl(GenericSpider, 
                    blobConfig=blob_config, 
                    msgReqObj=msg_req_obj,
                    urls=site_urls)

        deferred = self.runner.join()
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run()
class CrawlProcess(object):
    """description of class"""

    def __init__(self, blob_service, blob_service_output_container_name):
        """
        Constructor
        """        
        self.blob_service = blob_service
        self.blob_service_output_container_name = blob_service_output_container_name
        settings_file_path = 'scrapy_app.scrapy_app.settings' # The path seen from root, ie. from crawlProcess.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.runner = ''

    def spider_closing(self, spider):
        """Activates on spider closed signal"""
        print("STOPPING SPIDER")
        self.runner.join()

    def crawl_sites(self, blob_config, site_urls, msg_req_obj):
        print("SPIDER STARTED")
        print(site_urls)
        s = get_project_settings()
        s.update({
            "DEPTH_LIMIT" : msg_req_obj.crawlDepth,
            "MAX_RESPONSES_TO_CRAWL" : msg_req_obj.crawlPageCount,
        })        
        self.runner = CrawlerRunner(s)

        self.runner.crawl(GenericSpider, 
                    blobConfig=blob_config, 
                    msgReqObj=msg_req_obj,
                    urls=site_urls)

        deferred = self.runner.join()
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run()

    def start_process(self, site_urls, msg_req_obj):
        blob_config = BlobConfig(blob_service=self.blob_service, blob_container_name=self.blob_service_output_container_name,)

        crawl_sites_process = mp.Process(target=self.crawl_sites, args=(blob_config, site_urls, msg_req_obj), daemon=True)

        print("STARTING SPIDER")
        crawl_sites_process.start()
        crawl_sites_process.join()
        print("SPIDER STOPPED")
        print("ENGINE STOPPED")   

import scrapy
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urlparse
try:
    from scrapy_app.scrapy_app.items import HtmlItem
except ImportError:
    from scrapy_app.items import HtmlItem

import re
import os
import json
from scrapy_splash.response import SplashJsonResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy_app.scrapy_app.utils import get_domain_from_url, get_subdomain_from_url
import logging
from config_settings import ConfigurationSettings
from scrapy_splash import SplashRequest

logger = logging.getLogger(__name__)

class GenericSpider(CrawlSpider):
    extractor = LinkExtractor()
    crawl_depth = 0
    name = 'generic'
    configurationSettings = ConfigurationSettings.getInstance()
    handle_httpstatus_list = configurationSettings.handle_http_statuses
    handle_httpstatus_all = configurationSettings.handle_all_http_statuses
    custom_settings = {
        'ROBOTSTXT_OBEY': configurationSettings.obey_robotstxt,
        'DEPTH_LIMIT': configurationSettings.depth_limit,
        'DOWNLOAD_DELAY': configurationSettings.download_delay_for_requests,
        'CLOSESPIDER_PAGECOUNT': configurationSettings.max_responses_to_crawl
    }


    logger.setLevel(logging.INFO)
    logging.basicConfig(
           filename='scraping.log',
           format='%(levelname)s: %(message)s',
           level=logging.INFO
       )

    def __init__(self, crawler, *args, **kwargs):
        self.crawler = crawler
        self.blobConfig = kwargs.get('blobConfig')
        self.msgReqObj = kwargs.get('msgReqObj')
        self.urls = kwargs.get('urls')
        self.allowed_domains = [urlparse(url).netloc for url in self.urls]
        self.start_urls = self.urls
        self.proxy_pool = self.configurationSettings.proxies

        self.suggestedKeywords = self.configurationSettings.suggested_keywords


        self.rules = [Rule(LinkExtractor(allow=(), allow_domains=self.allowed_domains,
                                         canonicalize=True, unique=True,), 
                           follow=True, callback="parse_item", process_request="use_splash_request"),]

        self._follow_links = True
        self._domain = ""

        self.failed_urls_dict = {}
        for httpstatus in self.handle_httpstatus_list:
            self.failed_urls_dict[httpstatus] = []


        super(GenericSpider, self).__init__(crawler, *args, **kwargs)


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        # settings = crawler.settings
        return cls(crawler, *args, **kwargs)

    def parse_item(self, response):

        # if self.handle_httpstatus_all or response.status not in self.handle_httpstatus_list:              # Without this line, ALL HTTP Responses are handleds
        item = self._get_item(response)
        yield item

def _get_item(self, response):

        children = []
        # Get parameters from the Response
        _domain = response.meta['url_domain'] if 'url_domain' in response.meta else get_domain_from_url(response.url)
        _subdomain = get_subdomain_from_url(response.url, _domain)
        _comparableId = response.meta['comparableId'] if 'comparableId' in response.meta else 'NA'
        root = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(response.url))
        _html = response.text
        base_tag = response.css("head base").extract()
        if not base_tag:
            _html = _html.replace("</head>", "<base href=\"" + root + "/\"></head>")

        #Populate Child pages List
        links = self.extractor.extract_links(response)
        [children.append(link.url) for link in links]


        item = HtmlItem(
            url=response.url,
            domain=_domain,
            subdomain=_subdomain,
            html=_html,
            description='',
            title='',
            is_suggested=str(False),
            comparable_id=str(_comparableId),
            is_error=str(False) if 200 <= response.status < 300 else str(True),
            http_status=response.status,
            crawl_depth=response.meta['depth'],
            child_pages=children
        )

        self._set_title(item, response)
        self._set_description(item, response)
        self._is_suggested(item)

        return item

    def _set_title(self, item, response):
        if isinstance(response, SplashJsonResponse) or response.meta['isFirstPage'] == True:
            title = response.css("title::text").extract()
            if title:
                item['title'] = title[0].encode("utf-8")
        else: 
            pass

    def _set_description(self, item, response):
        if isinstance(response, SplashJsonResponse):
            meta_description = response.css("meta[name=\"description\"]::attr(content)").extract()
            if meta_description:
                item['description'] = meta_description[0].encode("utf-8")

    def _is_suggested(self, item):
        #logger.info('TITLE-DESCRIPTION:- %(title)s ==> %(desc)s', {'title': item['title'], 'desc': item['description']})
        _title = item['title'].decode("utf-8") if item['title'] else ''
        _description = item['description'].decode("utf-8") if item['description'] else ''
        try :
            if any(re.search(r'\b' + sug_kwd + r'\b', _title, re.IGNORECASE) for sug_kwd in self.suggestedKeywords) \
              or any(re.search(r'\b' + sug_kwd + r'\b', _description, re.IGNORECASE) for sug_kwd in self.suggestedKeywords):
                item['is_suggested'] = str(True)
        except Exception as ex:
            template = "GenericSpider:- An exception of type {0} occurred. Arguments:\n{1!r}"
            ex_message = template.format(type(ex).__name__, ex.args)
            print(ex_message)