Python Scrapy spider无法正确迭代&;有一个If语句问题

Python Scrapy spider无法正确迭代&;有一个If语句问题,python,for-loop,scrapy,Python,For Loop,Scrapy,我正试图用Scrapy从一张表中抓取申请人数据。我有两个问题: 1) 我希望每行都有一份CSV: 'username': ['clickclack123'],'lsat':['170'],'gpa':['3.57']... “我的代码”当前提取一行中的所有申请者数据,忽略空值,并对页面上的申请者数量(100行相同的行,其中每行包含页面上的所有数据)重复该提取: 2) 该表包含一类表示申请人特征的元素(“能指”)。我想包括一个If语句来检查能指,并将每个特征保存为True(如果适用)。我在la

我正试图用Scrapy从一张表中抓取申请人数据。我有两个问题:

1) 我希望每行都有一份CSV:

'username': ['clickclack123'],'lsat':['170'],'gpa':['3.57']... 
“我的代码”当前提取一行中的所有申请者数据,忽略空值,并对页面上的申请者数量(100行相同的行,其中每行包含页面上的所有数据)重复该提取:

2) 该表包含一类表示申请人特征的元素(“能指”)。我想包括一个If语句来检查能指,并将每个特征保存为True(如果适用)。我在lawschool.py(如下)中包含了一个带有此逻辑的If语句,但它不允许我的spider运行

我的想法和尝试:

  • 对于问题#1,我看到过类似问题的帖子,但这些解决方案在这种情况下不起作用,因为我的数据包含我不希望忽略的空值
  • 我相信我的For循环有一个问题,因为它没有在每个申请者身上正确地迭代,但我一直无法修复它。它当前将页面上的所有数据提取到my CSV的一行中,但对页面上的申请者数量(100个相同的行,其中每行包含页面上的所有数据)重复提取。如果我将extract()更改为extract_first(),爬行器将只提取第一个申请者的数据(100行相同的行,其中每行包含第一个申请者的数据)
  • 对于问题#2,我不确定为什么我的代码没有使用这个If语句运行,我不得不对它进行注释以解决问题#1
法学院

import scrapy
from ..items import ApplicantItem

class LawschoolSpider(scrapy.Spider):
    name = "lawschool"
    start_urls = [
        'http://nyu.lawschoolnumbers.com/applicants',
        'http://columbia.lawschoolnumbers.com/applicants'
    ]

    def parse(self, response):
        items = []
        for applicant in response.xpath("//tr[@class='row']"):
            signifier = response.xpath("//span[@class='signifier']/text()").extract()
            if signifier == 'W':
                withdrawn = True
            elif signifier == 'A':
                accepted == True
            elif signifier == 'U':
                minority == True
            elif signifier == 'N':
                non_traditional == True
            elif signifier == 'I':
                international = True
            else:
                return False
            school = response.xpath("//h1/text()").extract()
            school = [i.replace(' Applicants','') for i in school]
            item = ApplicantItem(
                school = school,
                username = response.xpath("//td/a/text()").extract(),
                lsat = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[1]/text()").extract(),
                gpa = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[2]/text()").extract(),
                scholarship = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[4]/text()").extract(),
                status = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[5]/text()").extract(),
                sent = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[6]/text()").extract(),
                complete = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[7]/text()").extract(),
                decision = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[8]/text()").extract(),
                last_updated = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[9]/text()").extract()
                withdrawn_application = withdrawn,
                accepted_offer = accepted,
                minority = minority,
                non_traditional = non_traditional,
                international = international
            )
            yield item

        for a in response.xpath("//*[@id='applicants_list']/div/a[9]"):
            yield response.follow(a, callback=self.parse)
items.py

from scrapy import Item, Field


class ApplicantItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    school = Field()
    username = Field()
    lsat = Field()
    gpa = Field()
    scholarship = Field()
    status = Field()
    sent = Field()
    complete = Field()
    decision = Field()
    last_updated = Field()
    withdrawn_application = Field()
    accepted_offer = Field()
    minority = Field()
    non_traditional = Field()
    international = Field()
pipeline.py

from scrapy import signals
from scrapy.exporters import CsvItemExporter

from .items import ApplicantItem

class LSNPipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        item_names = ['applicant']
        self.files = self.files = {n: open('%s.csv' % n, 'w+b') for n in item_names}
        self.exporters = {n: CsvItemExporter(f) for n, f in self.files.items()}
        for exporter in self.exporters.values():
            exporter.start_exporting()

    def spider_closed(self, spider):
        for exporter in self.exporters.values():
            exporter.finish_exporting()

        for file in self.files.values():
            file.close()

    def process_item(self, item, spider):
        if isinstance(item, ApplicantItem):
            self.exporters['applicant'].export_item(item)

        return item
您需要相对XPath表达式:

username = applicant.xpath(".//td/a/text()").extract(),
lsat = applicant.xpath(".//td[2]/text()").extract(),
gpa = applicant.xpath(".//td[3]/text()").extract(),
...

谢谢你看!我试过了,但还是有同样的问题。您知道如何调整代码以正确迭代吗?
username = applicant.xpath(".//td/a/text()").extract(),
lsat = applicant.xpath(".//td[2]/text()").extract(),
gpa = applicant.xpath(".//td[3]/text()").extract(),
...