无法按顺序刮取多个域-Python Scrapy_Python_Csv_Scrapy_Web Crawler_Export To Csv

无法按顺序刮取多个域-Python Scrapy

python csv scrapy web-crawler

无法按顺序刮取多个域-Python Scrapy,python,csv,scrapy,web-crawler,export-to-csv,Python,Csv,Scrapy,Web Crawler,Export To Csv,我对python和web抓取都是相当陌生的。我的第一个项目是在交通子域（即）下随机抓取Craiglist城市（总共5个城市），尽管在脚本中的常量>>>（start_url=和absolute_next_url=）下手动更新每个城市各自的域后，我不得不在每个城市手动运行脚本。我是否可以调整脚本，使其按顺序在我定义的城市（如迈阿密、纽约、休斯顿、芝加哥等）中运行，并自动填充其各自城市的常量（start_url=和absolute_next_url=）此外，是否有方法调整脚本以将每个城市输出到自己的

我对python和web抓取都是相当陌生的。我的第一个项目是在交通子域（即）下随机抓取Craiglist城市（总共5个城市），尽管在脚本中的常量>>>（start_url=和absolute_next_url=）下手动更新每个城市各自的域后，我不得不在每个城市手动运行脚本。我是否可以调整脚本，使其按顺序在我定义的城市（如迈阿密、纽约、休斯顿、芝加哥等）中运行，并自动填充其各自城市的常量（start_url=和absolute_next_url=）

此外，是否有方法调整脚本以将每个城市输出到自己的.csv>>（即miami.csv、houston.csv、chicago.csv等）

先谢谢你

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["craigslist.org"]
    start_urls = ['https://dallas.craigslist.org/d/transportation/search/trp']

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            listing_title = job.xpath('a/text()').extract_first()
            city = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
            job_posting_date = job.xpath('time/@datetime').extract_first()
            job_posting_url = job.xpath('a/@href').extract_first()
            data_id = job.xpath('a/@data-id').extract_first()


            yield Request(job_posting_url, callback=self.parse_page, meta={'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'data_id':data_id})

        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = "https://dallas.craigslist.org" + relative_next_url

        yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        job_posting_url = response.meta.get('job_posting_url')
        listing_title = response.meta.get('listing_title')
        city = response.meta.get('city')
        job_posting_date = response.meta.get('job_posting_date')
        data_id = response.meta.get('data_id')

        description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract()).strip()

        compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first()
        employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first()
        latitude = response.xpath('//div/@data-latitude').extract_first()
        longitude = response.xpath('//div/@data-longitude').extract_first()
        posting_id = response.xpath('//p[@class="postinginfo"]/text()').extract()


        #yield{'job_posting_url': job_posting_url, 'listing_title': listing_title, 'city':city, 'job_posting_date':job_posting_date, 'description':description, #'compensation':compensation, 'employment_type':employment_type, 'posting_id':posting_id, 'longitude':longitude, 'latitude':latitude }

        yield{'job_posting_url':job_posting_url,
                      'data_id':data_id,
                'listing_title':listing_title,
                         'city':city,
                  'description':description,
                 'compensation':compensation,
              'employment_type':employment_type,
                     'latitude':latitude,
                    'longitude':longitude,
             'job_posting_date':job_posting_date,
                   'posting_id':posting_id,
                      'data_id':data_id
              }

可能有一种更简洁的方法，但请检查，您基本上可以将您的spider的多个实例组合在一起，这样您就可以为每个城市创建一个单独的“类”。可能有一些方法可以合并一些代码，这样就不会重复

至于写入csv，您现在是否通过命令行执行此操作？我会将代码添加到蜘蛛本身

很高兴听到！