Python Scrapy-合并来自单个页面的N个页面的结果_Python_Scrapy

Python Scrapy-合并来自单个页面的N个页面的结果

python scrapy

Python Scrapy-合并来自单个页面的N个页面的结果,python,scrapy,Python,Scrapy,我正在浏览一个网页，上面有一门课程的信息。该页面还具有指向评估页面的链接，每年一个，因此存在一对N关系。我有一个解析主页的方法和一个解析评估页面的方法。第一个方法为找到的每个链接调用第二个方法我的问题是，我应该在哪里返回Item对象 def parse_course(self, response): hxs = HtmlXPathSelector(response) main_div = select_single(hxs, '//div[@class = "CourseVie

我正在浏览一个网页，上面有一门课程的信息。该页面还具有指向评估页面的链接，每年一个，因此存在一对N关系。我有一个解析主页的方法和一个解析评估页面的方法。第一个方法为找到的每个链接调用第二个方法

我的问题是，我应该在哪里返回Item对象

def parse_course(self, response):
    hxs = HtmlXPathSelector(response)
    main_div = select_single(hxs, '//div[@class = "CourseViewer"]/div[@id = "pagecontents"]')
    course = CourseItem()
    # here I scrape basic info about the item
    self.process_heading(main_div, course)
    grades_table = select_single(hxs, '//td[@class = "ContentMain"]//table[contains(tr/td/b/text(), "Grades")]')
    grade_links = grades_table.select('tr[2]/td[2]/a/@href').extract()
    for link in grade_links:
        yield Request(link, callback = self.parse_grade_dist_page, meta = {'course' : course})

def parse_grade_dist_page(self, response):
    course = response.meta['course']
    # scrape additional data and store it in CourseItem

有许多方法，以下是一些：

您可以跟踪所做的请求，并在最后一次请求时返回项目。这可能很难，因为当请求失败时，您必须处理该情况

您可以以线性方式一个接一个地执行每个请求。此外，当请求失败时，您还必须处理该情况并继续处理其他请求

您可以使用：

@inline_requests
def parse_course(self, response):

    # ...

    for link in grade_links:
        try:
            response = yield Request(link)
        except Exception as e:
            # handle the exception here
            pass
        else:
            # extract the data here
            pass

     # at the end yield the item