Python 古怪的爬行动物_Python_Python 3.x_Web Scraping_Scrapy_Scrapy Spider

Python 古怪的爬行动物

python python-3.x web-scraping scrapy

Python 古怪的爬行动物,python,python-3.x,web-scraping,scrapy,scrapy-spider,Python,Python 3.x,Web Scraping,Scrapy,Scrapy Spider,我已经用python scrapy编写了一个脚本来解析craigslist中的不同类别。我注意到一些奇怪的东西在执行脚本。它运行完美，没有留下任何抱怨。但是，问题是：如果我像下面那样将items.py保留为空白，则在爬行过程中不会产生任何影响。我的问题是，它在我的scrapy项目中做什么？提前谢谢 “Items.py”文件包含： import scrapy class CraigItem(scrapy.Item): pass import scrapy from scrapy im

我已经用python scrapy编写了一个脚本来解析craigslist中的不同类别。我注意到一些奇怪的东西在执行脚本。它运行完美，没有留下任何抱怨。但是，问题是：如果我像下面那样将

items.py保留为空白，则在爬行过程中不会产生任何影响。我的问题是，它在我的scrapy项目中做什么？提前谢谢
“Items.py”文件包含：
import scrapy

class CraigItem(scrapy.Item):
    pass

import scrapy 
from scrapy import Request

class JobsSpider(scrapy.Spider):

    name = "category"
    allowed_domains = ["craigslist.org"]
    start_urls = ["https://newyork.craigslist.org/search/egr"]

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            relative_url = job.xpath('a/@href').extract_first()
            absolute_url = response.urljoin(relative_url)
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]   
            yield Request(absolute_url, callback=self.parse_page, meta={'URL': absolute_url, 'Title': title, 'Address':address})

        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = "https://newyork.craigslist.org" + relative_next_url    
        yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        url = response.meta.get('URL')
        title = response.meta.get('Title')
        address = response.meta.get('Address')
        compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
        employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()
        yield{'URL': url, 'Title': title, 'Address':address, 'Compensation':compensation, 'Employment_Type':employment_type}

蜘蛛网包括：
import scrapy

class CraigItem(scrapy.Item):
    pass

import scrapy 
from scrapy import Request

class JobsSpider(scrapy.Spider):

    name = "category"
    allowed_domains = ["craigslist.org"]
    start_urls = ["https://newyork.craigslist.org/search/egr"]

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            relative_url = job.xpath('a/@href').extract_first()
            absolute_url = response.urljoin(relative_url)
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]   
            yield Request(absolute_url, callback=self.parse_page, meta={'URL': absolute_url, 'Title': title, 'Address':address})

        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = "https://newyork.craigslist.org" + relative_next_url    
        yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        url = response.meta.get('URL')
        title = response.meta.get('Title')
        address = response.meta.get('Address')
        compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
        employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()
        yield{'URL': url, 'Title': title, 'Address':address, 'Compensation':compensation, 'Employment_Type':employment_type}

我的问题是：items.py
文件在爬行过程中没有任何监管吗？如果是，怎么做？
你应该先读一读。简言之，Scrapy Items是类似dict的类，用于定义爬行器生成的项。当您从spider中生成一个项目时，它必须是Scrapy项目或dict（或者，请求
对象）。在您的spider中，您选择使用第二种方法，即生成普通dict
文件items.py
是由scrapy startproject
命令生成的模板，该命令定义了空白的Item类，以便在需要时对其进行增强。但由于您在spider中没有使用该类，Scrapy也没有使用该类。
感谢TomášLinhart的回答。事实上，我将items.py
从scrapy项目中踢了出来，然后再次运行。我发现它还在工作。这意味着，、它没有被Scrapy使用。@Top当然，它只是一个（几乎）空白的模块，没有被导入任何地方。下面是我想确定的。非常感谢汤姆·亚什·林哈特。你让我开心。