Python 单个项目与多个项目_Python_Web Scraping_Scrapy_Screen Scraping_Scrape

Python 单个项目与多个项目

python web-scraping scrapy

Python 单个项目与多个项目,python,web-scraping,scrapy,screen-scraping,scrape,Python,Web Scraping,Scrapy,Screen Scraping,Scrape,我在如何储存我所有的蜘蛛上左右为难。这些spider将通过命令行调用和从stdin读取的项被馈送到apachenifi。我还计划让这些爬行器的一个子集在单独的web服务器上使用scrapyrt返回单项结果。我需要使用不同的项目模型在许多不同的项目中创建spider。它们都具有类似的设置（例如使用相同的代理）我的问题是，什么是构建我的scrapy项目的最佳方式将所有spider放在同一个存储库中。提供了一种为项目装入器和项目管道创建基类的简单方法将我正在处理的每个项目的爬行器分组到单独的存储

我在如何储存我所有的蜘蛛上左右为难。这些spider将通过命令行调用和从

stdin

读取的项被馈送到apachenifi。我还计划让这些爬行器的一个子集在单独的web服务器上使用scrapyrt返回单项结果。我需要使用不同的项目模型在许多不同的项目中创建spider。它们都具有类似的设置（例如使用相同的代理）

我的问题是，什么是构建我的scrapy项目的最佳方式

将所有spider放在同一个存储库中。提供了一种为项目装入器和项目管道创建基类的简单方法

将我正在处理的每个项目的爬行器分组到单独的存储库中。这样做的好处是允许项目成为每个项目的焦点，并且不会变得太大。无法共享通用代码、设置、spider监视器（spidermon）和基类。这感觉最干净，尽管有一些重复

仅打包我计划在NiFi回购中使用的非实时spider和在另一个回购中使用的实时spider。的优点是，我将spider与实际使用它们的项目一起保存，但仍然集中/盘旋哪些spider与哪些项目一起使用

感觉正确的答案是#2。与特定程序相关的爬行器应该位于其自己的scrapy项目中，就像为项目a创建web服务时一样，您不会说哦，我可以将项目B的所有服务端点扔到同一个服务中，因为我的所有服务都将位于该服务中，即使某些设置可能会重复。可以说，一些共享代码/类可以通过单独的包共享

你觉得怎么样？你们是如何组织你们的零碎项目以最大限度地提高重用性的？同一个项目与单独项目的界限在哪里？它是基于您的项目模型还是数据源？

首先，当我编写一个类似于

'/path'

的路径时，这是因为我是Ubuntu用户。如果您是Windows用户，请调整它。这是一个文件管理系统的问题

简单的例子让我们想象一下，您想要刮取两个或更多不同的网站。第一个是一个泳衣零售网站。第二个是关于天气。你想把两者都刮去，因为你想观察泳衣价格和天气之间的联系，以便预测较低的购买价格
请注意，在
pipelines.py
中，我将使用mongo集合，因为这就是我所使用的，我暂时不需要SQL。如果你不知道mongo，那么考虑一个集合就相当于关系数据库中的一个表。 scrapy项目可能如下所示：

spiderswebsites.py
，您可以在这里编写所需的spider数量

import scrapy from ..items.py import SwimItem, WeatherItem #if sometimes you have trouble to import from parent directory you can do #import sys #sys.path.append('/path/parentDirectory') class SwimSpider(scrapy.Spider): name = "swimsuit" start_urls = ['https://www.swimsuit.com'] def parse (self, response): price = response.xpath('span[@class="price"]/text()').extract() model = response.xpath('span[@class="model"]/text()').extract() ... # and so on item = SwimItem() #needs to be called -> () item['price'] = price item['model'] = model ... # and so on return item class WeatherSpider(scrapy.Spider): name = "weather" start_urls = ['https://www.weather.com'] def parse (self, response): temperature = response.xpath('span[@class="temp"]/text()').extract() cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() ... # and so on item = WeatherItem() #needs to be called -> () item['temperature'] = temperature item['cloud'] = cloud ... # and so on return item

import scrapy class SwimItem(scrapy.Item): price = scrapy.Field() stock = scrapy.Field() ... model = scrapy.Field() class WeatherItem(scrapy.Item): temperature = scrapy.Field() cloud = scrapy.Field() ... pressure = scrapy.Field()

items.py
，您可以在这里编写所需的项目模式数

import scrapy from ..items.py import SwimItem, WeatherItem #if sometimes you have trouble to import from parent directory you can do #import sys #sys.path.append('/path/parentDirectory') class SwimSpider(scrapy.Spider): name = "swimsuit" start_urls = ['https://www.swimsuit.com'] def parse (self, response): price = response.xpath('span[@class="price"]/text()').extract() model = response.xpath('span[@class="model"]/text()').extract() ... # and so on item = SwimItem() #needs to be called -> () item['price'] = price item['model'] = model ... # and so on return item class WeatherSpider(scrapy.Spider): name = "weather" start_urls = ['https://www.weather.com'] def parse (self, response): temperature = response.xpath('span[@class="temp"]/text()').extract() cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() ... # and so on item = WeatherItem() #needs to be called -> () item['temperature'] = temperature item['cloud'] = cloud ... # and so on return item

import scrapy class SwimItem(scrapy.Item): price = scrapy.Field() stock = scrapy.Field() ... model = scrapy.Field() class WeatherItem(scrapy.Item): temperature = scrapy.Field() cloud = scrapy.Field() ... pressure = scrapy.Field()

pipelines.py
，其中我使用了Mongo

import pymongo from .items import SwimItem,WeatherItem from .spiders.spiderswebsites import SwimSpider , WeatherSpider class ScrapePipeline(object): def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod #this is a decorator, that's a powerful tool in Python def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGODB_URL'), mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): if isinstance(spider, SwimItem): self.collection_name = 'swimwebsite' elif isinstance(spider, WeatherItem): self.collection_name = 'weatherwebsite' self.db[self.collection_name].insert(dict(item))
因此，当您查看我的示例的项目时，您会发现该项目完全不依赖于项的模式，因为您可以在同一个项目中使用几种类型的项。在上面的模式中，优点是如果需要，可以在
settings.py
中保持相同的配置。但别忘了你可以“定制”你的蜘蛛的命令。如果希望爬行器的运行与默认设置略有不同，可以将其设置为
scrapy crawl spider-s DOWNLOAD\u DELAY=35
，而不是在
settings.py
中编写的
25
函数式编程此外，这里我的例子是轻。实际上，您很少对原始数据感兴趣。你需要很多处理方法，它们代表了很多线条。为了提高代码的可读性，可以在模块中创建函数。但是要小心

functions.py
，自定义模块

from re import search def cloud_temp(response): #for WeatherSpider """returns a tuple containing temperature and percentage of clouds""" temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C" cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%" #treatments, you want to record it as integer temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12 cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30 return (cloud,temperature)
它在spider.py中给出

import scrapy from items.py import SwimItem, WeatherItem from functions.py import * ... class WeatherSpider(scrapy.Spider): name = "weather" start_urls = ['https://www.weather.com'] def parse (self, response): cloud , temperature = cloud_temp(response) "this is shorter than the previous one ... # and so on item = WeatherItem() #needs to be called -> () item['temperature'] = temperature item['cloud'] = cloud ... # and so on return item
此外，它在调试方面也有相当大的改进。假设我想做一个scrapy shell会话

>>> scrapy shell https://www.weather.com ... #I check in the sys path if the directory where my `functions.py` module is present. >>> import sys >>> sys.path #returns a list of paths >>> #if the directory is not present >>> sys.path.insert(0, '/path/directory') >>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself >>> from functions.py import * >>> cloud_temp(response) #checking if it returns what I want.
这比复制和粘贴一段代码更舒服。因为Python是一种非常适合函数式编程的编程语言，所以您应该从中受益。这就是为什么我告诉你“一般来说，任何模式都是有效的，如果你限制行数，提高可读性，也限制bug。”可读性越强，限制bug的次数就越多。写的行数越少（为了避免对不同的变量复制和粘贴相同的处理方式），限制错误的数量就越少。因为当你对一个函数本身进行修正时，你就修正了所有依赖它的东西

现在，如果您对函数式编程不太熟悉，我可以理解您为不同的项目模式制定了几个项目。您可以利用当前的技能进行改进，然后随着时间的推移改进您的代码。
Jakob来自Google组主题为“推荐：
Spider是否应该加入同一个项目主要取决于根据他们收集的数据类型，而不是数据来源
假设您正在从所有目标站点中删除用户配置文件，然后您可能有一个项目管道，用于清理和验证用户化身，并将其导出到您的“化身”数据库中。这是有道理的将所有蜘蛛放在同一个项目中。毕竟，他们都使用相同的管道，因为数据始终具有相同的形状从哪里刮来的。另一方面，如果你在刮来自堆栈溢出、Wikipedia用户配置文件的问题，以及您可以验证/处理/导出所有这些数据类型不同，将蜘蛛放入单独的项目
换句话说，如果您的爬行器具有共同的依赖关系（例如共享项目定义/管道/中间件），