Python 一个标题下的多个数据,带刮痕

Python 一个标题下的多个数据,带刮痕,python,scrapy,Python,Scrapy,我正在编写一个具有以下两个功能的scraper,位于爬行过程的底部 def parse_summary(self, response): hxs = HtmlXPathSelector(response) item = response.meta['item'] soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0]) text = soup.get_text()

我正在编写一个具有以下两个功能的scraper,位于爬行过程的底部

def parse_summary(self, response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
    text = soup.get_text()
    item['main_summary'] = text

    summary_links = hxs.select("//ul[@class='module_leftnav']/li/a/@href").extract()
    chap_summary_links = [urljoin(response.url, link) for link in summary_links]

    for link in chap_summary_links:
        print 'yielding request to chapter summary.'
        yield Request(link, callback=self.parse_chap_summary_link, meta={'item': item})


def parse_chap_summary_link(self, response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['chapter_summaries'] = hxs.select("//h1/text()").extract()
    soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
    text = soup.get_text()
    item['chapter_summaries'] += [text]
    yield item
parse_summary
的底部,我生成对
parse_chap_summary_link
的请求,以便从章节摘要页面提取数据。这是可行的,但问题是输出给了我:

{item 1, [chapter 1 summary]}
{item 1, [chapter 2 summary]}
但我想:

{item 1, [Chapter 1 summary, Chapter 2 Summary]}
{item 2, [Chapter 1 summary, Chapter 2 Summary, Chapter 3 etc etc]}

如何将所有章节摘要信息放在一个标题中,而不是为每个章节摘要创建一个新项目?

一个选项是一个接一个地执行每个请求。比如说

def parse_summary(self, response):
    # ...

    links = [urljoin(response.url, link) for link in summary_links]
    return self._dispatch_summary_request(item, links)

def parse_chap_summary_link(self, response):
    item = response.meta['item']

    # ... collect summary into the item field

    return self._dispatch_summary_request(item, response.meta['summary_links'])

def _dispatch_summary_request(self, item, links):
    try:
        next_link = links.pop()
    except IndexError:
        # no links left
        return item
    else:
        # TODO: it might happen that one request fails and to not lose the item
        # the request must have an errback callback to handle the failure and
        # resume the next summary request.
        return Request(next_link, meta={'item': item, 'summary_links': links},
                       callback=self.parse_chap_summary_link)
另一个选项是使用装饰器:

@inline_requests
def parse_summary(self, response):
    # ...
    for link in chap_summary_links:
        try:
            response = yield Request(link)
        except Exception:
            # TODO: handle the error, log or something
            pass
        else:
            # extract the summary as in parse_chap_summary_link ...
            item['chapter_summaries'] += [text]

    # Must use yield at the end as this callback is a generator
    # due the previous yield statements.
    yield item

嗨,罗,谢谢你的回答。我尝试了第一种方法,虽然它只捕获第一个摘要链接的文本,而不捕获其他链接(如果存在)。你能给我一些建议吗?啊,刚刚知道。我需要在
parse\u summary
中将
项['chapter\u summaries']
初始化为一个空列表。