如何从外部python脚本正确运行scrapy Spider并获得其项目输出
因此,我正在制作一些scraper,现在我正在尝试制作一个脚本,用从数据库收集的URL运行相应的爬行器,但我找不到一种方法来做到这一点 我的蜘蛛里有这个:如何从外部python脚本正确运行scrapy Spider并获得其项目输出,python,scrapy,Python,Scrapy,因此,我正在制作一些scraper,现在我正在尝试制作一个脚本,用从数据库收集的URL运行相应的爬行器,但我找不到一种方法来做到这一点 我的蜘蛛里有这个: class ElCorteIngles(scrapy.Spider): name = 'ElCorteIngles' url = '' DEBUG = False def start_requests(self): if self.url != '': yield scrapy.Request(url=self.ur
class ElCorteIngles(scrapy.Spider):
name = 'ElCorteIngles'
url = ''
DEBUG = False
def start_requests(self):
if self.url != '':
yield scrapy.Request(url=self.url, callback=self.parse)
def parse(self, response):
# Get product name
try:
self.p_name = response.xpath('//*[@id="product-info"]/h2[1]/a/text()').get()
except:
print(f'{CERROR} Problem while getting product name from website - {self.name}')
# Get product price
try:
self.price_no_cent = response.xpath('//*[@id="price-container"]/div/span[2]/text()').get()
self.cent = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[2]/text()').get()
if self.currency == None:
self.currency = response.xpath('//*[@id="price-container"]/div/span[2]/span[1]/text()').get()
self.cent = None
except:
print(f'{CERROR} Problem while getting product price from website - {self.name}')
# Join self.price_no_cent with self.cent
try:
if self.cent != None:
self.price = str(self.price_no_cent) + str(self.cent)
self.price = self.price.replace(',', '.')
else:
self.price = self.price_no_cent
except:
print(f'{ERROR} Problem while joining price with cents - {self.name}')
# Return data
if self.DEBUG == True:
print([self.p_name, self.price, self.currency])
data_collected = ShopScrapersItems()
data_collected['url'] = response.url
data_collected['p_name'] = self.p_name
data_collected['price'] = self.price
data_collected['currency'] = self.currency
yield data_collected
通常,当我从控制台运行spider时,我会:
scrapy crawl ElCorteIngles -a url='https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora-braun-senso-smart-5-5500/'
现在我需要一种方法在外部脚本上执行同样的操作,并获得输出生成收集的数据
目前我的外部脚本中包含以下内容:
import scrapy
from scrapy.crawler import CrawlerProcess
import sqlalchemy as db
# Import internal libraries
from Ruby.Ruby.spiders import *
# Variables
engine = db.create_engine('mysql+pymysql://DATABASE_INFO')
class Worker(object):
def __init__(self):
self.crawler = CrawlerProcess({})
def scrape_new_links(self):
conn = engine.connect()
# Get all new links from DB and scrape them
query = 'SELECT * FROM Ruby.New_links'
result = conn.execute(query)
for x in result:
telegram_id = x[1]
email = x[2]
phone_number = x[3]
url = x[4]
spider = x[5]
# In this cade the spider will be ElCorteIngles and
# the url https://www.elcorteingles.pt/electrodomesticos/A26601428-depiladora-
# braun-senso-smart-5-5500/'
self.crawler.crawl(spider, url=url)
self.crawler.start()
Worker().scrape_new_links()
我也不知道在self.crawler.crawl()
中执行url=url
是否是将url提供给爬行器的正确方法,但请告诉我您的想法。
管道正在返回来自yield
的所有数据。
我认为没有必要提供额外的信息,但如果你需要任何信息,请告诉我 最简单的方法是:
class ElCorteIngles(scrapy.Spider):
name = 'ElCorteIngles'
url = ''
DEBUG = False
def __init__(self):
super().__init__(self, **kwargs)
# Establish your db connection here. This can be any database connection.
# Reuse this connection object anywhere else
self.conn = conn = engine.connect()
def start_requests(self):
with self.conn.cursor() as cursor:
cursor.execute('''SELECT * FROM Ruby.New_links WHERE url NOT NULL OR url != %s''', ('',))
result = cursor.fetchall()
for url in result:
yield scrapy.Request(url=url, dont_filter=True, callback=self.parse)
def parse(self):
# Your Parse code here
完成此操作后,您可以使用类似以下内容启动此爬虫程序
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from project_name.spiders.filename import ElCorteIngles
process = CrawlerProcess(get_project_settings())
process.crawl(ElCorteIngles)
process.start()
希望这有帮助
如果您正在处理大量URL,我还建议您使用队列。这将使多个spider进程能够并行处理这些URL。您可以使用init方法启动队列。最简单的方法如下:
class ElCorteIngles(scrapy.Spider):
name = 'ElCorteIngles'
url = ''
DEBUG = False
def __init__(self):
super().__init__(self, **kwargs)
# Establish your db connection here. This can be any database connection.
# Reuse this connection object anywhere else
self.conn = conn = engine.connect()
def start_requests(self):
with self.conn.cursor() as cursor:
cursor.execute('''SELECT * FROM Ruby.New_links WHERE url NOT NULL OR url != %s''', ('',))
result = cursor.fetchall()
for url in result:
yield scrapy.Request(url=url, dont_filter=True, callback=self.parse)
def parse(self):
# Your Parse code here
完成此操作后,您可以使用类似以下内容启动此爬虫程序
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from project_name.spiders.filename import ElCorteIngles
process = CrawlerProcess(get_project_settings())
process.crawl(ElCorteIngles)
process.start()
希望这有帮助
如果您正在处理大量URL,我还建议您使用队列。这将使多个spider进程能够并行处理这些URL。您可以在init方法中启动队列。Scrapy异步工作…忽略我的导入,但这是我为Scrapy制作的JSON api。你需要制作一个带有物品刮擦信号的定制跑步者。最初有一个klein端点,当spider完成时,它将返回一个JSON列表。我想这是你想要的,但是没有克莱因端点,所以我把它取出来了。我的蜘蛛是GshopSpider我用你的蜘蛛名字代替了它 通过利用延迟,我们能够在每次刮取项目时使用回调并发送信号。因此,使用这段代码,我们用一个信号将每个项目收集到一个列表中,当spider完成时,我们有一个回调设置来返回\u spider\u输出
# server.py
import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from googleshop.spiders.gshop import GshopSpider
from scrapy.utils.project import get_project_settings
class MyCrawlerRunner(CrawlerRunner):
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
crawler = self.create_crawler(crawler_or_spidercls)
crawler.signals.connect(self.item_scraped, signals.item_scraped)
dfd = self._crawl(crawler, *args, **kwargs)
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
return json.dumps([dict(item) for item in output])
if __name__=="__main__"
settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = ElCorteIngles()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
return deferred
Scrapy异步工作…忽略我的导入,但这是我为Scrapy制作的JSON api。你需要制作一个带有物品刮擦信号的定制跑步者。最初有一个klein端点,当spider完成时,它将返回一个JSON列表。我想这是你想要的,但是没有克莱因端点,所以我把它取出来了。我的蜘蛛是GshopSpider我用你的蜘蛛名字代替了它 通过利用延迟,我们能够在每次刮取项目时使用回调并发送信号。因此,使用这段代码,我们用一个信号将每个项目收集到一个列表中,当spider完成时,我们有一个回调设置来返回\u spider\u输出
# server.py
import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from googleshop.spiders.gshop import GshopSpider
from scrapy.utils.project import get_project_settings
class MyCrawlerRunner(CrawlerRunner):
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
crawler = self.create_crawler(crawler_or_spidercls)
crawler.signals.connect(self.item_scraped, signals.item_scraped)
dfd = self._crawl(crawler, *args, **kwargs)
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
return json.dumps([dict(item) for item in output])
if __name__=="__main__"
settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = ElCorteIngles()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
return deferred
除外:
请参阅。除外:
请参阅。