Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/301.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/facebook/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从脚本调用spider时,如何向scrapy spider添加随机用户代理?_Python_Scrapy_Scrapy Spider - Fatal编程技术网

Python 从脚本调用spider时,如何向scrapy spider添加随机用户代理?

Python 从脚本调用spider时,如何向scrapy spider添加随机用户代理?,python,scrapy,scrapy-spider,Python,Scrapy,Scrapy Spider,我想将随机用户代理添加到由其他脚本调用的爬行器的每个请求中。我的执行情况如下: CoreSpider.py from scrapy.spiders import Rule import ContentHandler_copy class CoreSpider(scrapy.Spider): name = "final" def __init__(self): self.start_urls = self.read_url() self.rules = ( Ru

我想将随机用户代理添加到由其他脚本调用的爬行器的每个请求中。我的执行情况如下:

CoreSpider.py

from scrapy.spiders import Rule
import ContentHandler_copy 

class CoreSpider(scrapy.Spider):
name = "final"
def __init__(self):
    self.start_urls = self.read_url()
    self.rules = (
        Rule(
            LinkExtractor(
                unique=True,
            ),
            callback='parse',
            follow=True
        ),
    )


def read_url(self):
    urlList = []
    for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
        with open(filename, "r") as f:
            for line in f.readlines():
                url = re.sub('\n', '', line)
                if "http" not in url:
                    url = "http://" + url
                # print(url)
                urlList.append(url)

    return urlList

def parse(self, response):
    print("URL is: ", response.url)
    print("User agent is : ", response.request.headers['User-Agent'])
    filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
    article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
    print("Article is :", article)
    if len(article.split("\n")) < 5:
        print("Skipping to next url : ", article.split("\n"))
    else:
        print("Continue parsing: ", article.split("\n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)
它工作得很好,现在我想为每个请求随机使用不同的用户代理。我已成功地将随机用户代理用于scrapy项目,但在从其他脚本调用此蜘蛛时无法与此蜘蛛集成

我的
settings.py
用于工作的scrapy项目-

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
}

USER_AGENT_LIST = "tutorial/user-agent.txt"

如何告诉我的
CoreSpider.py
以编程方式使用此setting.py配置?

请查看文档。您可以将设置作为参数提供给
CrawlProcess
构造函数。或者,如果您使用Scrapy project并希望从
settings.py
中获取设置,您可以这样做:

...
from scrapy.utils.project import get_project_settings    
process = CrawlerProcess(get_project_settings())
...
crawl()
中,您可以使用类名而不使用
()
-
process.crawl(CoreSpider)
。即使您使用实例(
CoreSpider()
)作为参数,它也会自行创建此实例。
...
from scrapy.utils.project import get_project_settings    
process = CrawlerProcess(get_project_settings())
...