如何在Python中使用多处理来使用Scrapy抓取数百万个URL?
我已经实现了三层递归来生成url的种子列表,然后从每个url中删除信息。我想利用我的系统的所有核心来加速爬行。这是我到目前为止已经实现的爬虫代码如何在Python中使用多处理来使用Scrapy抓取数百万个URL?,python,python-3.x,web-scraping,scrapy,multiprocessing,Python,Python 3.x,Web Scraping,Scrapy,Multiprocessing,我已经实现了三层递归来生成url的种子列表,然后从每个url中删除信息。我想利用我的系统的所有核心来加速爬行。这是我到目前为止已经实现的爬虫代码 # -*- coding: utf-8 -*- import scrapy from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from CompanyInfoGrabber.Utility.utils import getAddr
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from CompanyInfoGrabber.Utility.utils import getAddress, getCompanyStatus, getDirectorDetail, getRegNumber
class CompanyInfoGrabberSpider(scrapy.Spider):
name = 'CompanyDetail'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
def parse(self, response):
counter = 0
print("User Agent in parse() is : ", response.request.headers['User-Agent'])
hxp = HtmlXPathSelector(response)
URL_LIST = hxp.select('//sitemapindex/sitemap/loc/text()').extract()
print("URL LIST: ", URL_LIST)
for URL in URL_LIST[:2]:
next_page = response.urljoin(URL)
yield Request(next_page, self.parse_page)
def parse_page(self, response):
print("User Agent in parse_page is : ", response.request.headers['User-Agent'])
hxp = HtmlXPathSelector(response)
# create seed list of company-url
COMPANY_URL_LIST = hxp.select('//urlset/url/loc/text()').extract()
print("Company url: ", COMPANY_URL_LIST[:20])
"""
Here I want to use multiprocessing like this
pool = Pool(processes=8)
pool.map(parse_company_detail, COMPANY_URL_LIST)
"""
for company_url in COMPANY_URL_LIST[:5]:
next_page = response.urljoin(company_url)
yield Request(next_page, self.parse_company_detail)
def parse_company_detail(self, response):
COMPANY_DATA = dict()
print("User Agent in parse_company_page() is : ", response.request.headers['User-Agent'])
hxp = HtmlXPathSelector(response)
_ABOUT_ = ''.join(hxp.xpath('normalize-space(//div[@class="panel-body"]/text())').extract())
for node in hxp.xpath('//div[@class="panel-body"]//p'):
_ABOUT_ += ''.join(node.xpath('string()').extract())
COMPANY_DATA['About'] = _ABOUT_
# Get company data.
COMPANY_DATA = getDirectorDetail(COMPANY_DATA, hxp)
print("Dictionary: ", COMPANY_DATA)
return COMPANY_DATA
如何使用多处理来抓取url的种子列表?
提前谢谢
更新:
我的问题不是重复的。这里我只有一只蜘蛛
问候,
omprakash我建议使用线程模块同时运行多个线程。您需要修改类以在init中使用URL参数
import threading
sites = ['URL1','URL2','URL3']
def create_instance():
global sites
CompanyInfoGrabberSpider(scrapy.Spider,sites[0])
sites.remove[sites[0]]
for site in sites:
threading.Thread(target=create_instance).start() # Create and start thread
我建议使用线程模块同时运行多个线程。您需要修改类以在init中使用URL参数
import threading
sites = ['URL1','URL2','URL3']
def create_instance():
global sites
CompanyInfoGrabberSpider(scrapy.Spider,sites[0])
sites.remove[sites[0]]
for site in sites:
threading.Thread(target=create_instance).start() # Create and start thread
可能是@ClémentDenoix的复制品,不,它不是复制品。我这里只有一只蜘蛛。可能是@ClémentDenoix的复制品,不,它不是复制品。我这里只有一只蜘蛛。