Python 一个标题下的多个数据，带刮痕_Python_Scrapy

Python 一个标题下的多个数据，带刮痕

python scrapy

Python 一个标题下的多个数据，带刮痕,python,scrapy,Python,Scrapy,我正在编写一个具有以下两个功能的scraper，位于爬行过程的底部 def parse_summary(self, response): hxs = HtmlXPathSelector(response) item = response.meta['item'] soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0]) text = soup.get_text()

我正在编写一个具有以下两个功能的scraper，位于爬行过程的底部

def parse_summary(self, response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
    text = soup.get_text()
    item['main_summary'] = text

    summary_links = hxs.select("//ul[@class='module_leftnav']/li/a/@href").extract()
    chap_summary_links = [urljoin(response.url, link) for link in summary_links]

    for link in chap_summary_links:
        print 'yielding request to chapter summary.'
        yield Request(link, callback=self.parse_chap_summary_link, meta={'item': item})


def parse_chap_summary_link(self, response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['chapter_summaries'] = hxs.select("//h1/text()").extract()
    soup = BeautifulSoup(hxs.select("//div[@class='PrimaryContent']").extract()[0])
    text = soup.get_text()
    item['chapter_summaries'] += [text]
    yield item

在

parse_summary

的底部，我生成对

parse_chap_summary_link

的请求，以便从章节摘要页面提取数据。这是可行的，但问题是输出给了我：

{item 1, [chapter 1 summary]}
{item 1, [chapter 2 summary]}

但我想：

{item 1, [Chapter 1 summary, Chapter 2 Summary]}
{item 2, [Chapter 1 summary, Chapter 2 Summary, Chapter 3 etc etc]}

如何将所有章节摘要信息放在一个标题中，而不是为每个章节摘要创建一个新项目？

一个选项是一个接一个地执行每个请求。比如说

def parse_summary(self, response):
    # ...

    links = [urljoin(response.url, link) for link in summary_links]
    return self._dispatch_summary_request(item, links)

def parse_chap_summary_link(self, response):
    item = response.meta['item']

    # ... collect summary into the item field

    return self._dispatch_summary_request(item, response.meta['summary_links'])

def _dispatch_summary_request(self, item, links):
    try:
        next_link = links.pop()
    except IndexError:
        # no links left
        return item
    else:
        # TODO: it might happen that one request fails and to not lose the item
        # the request must have an errback callback to handle the failure and
        # resume the next summary request.
        return Request(next_link, meta={'item': item, 'summary_links': links},
                       callback=self.parse_chap_summary_link)

另一个选项是使用装饰器：

@inline_requests
def parse_summary(self, response):
    # ...
    for link in chap_summary_links:
        try:
            response = yield Request(link)
        except Exception:
            # TODO: handle the error, log or something
            pass
        else:
            # extract the summary as in parse_chap_summary_link ...
            item['chapter_summaries'] += [text]

    # Must use yield at the end as this callback is a generator
    # due the previous yield statements.
    yield item

嗨，罗，谢谢你的回答。我尝试了第一种方法，虽然它只捕获第一个摘要链接的文本，而不捕获其他链接（如果存在）。你能给我一些建议吗？啊，刚刚知道。我需要在

parse\u summary

中将

项['chapter\u summaries']

初始化为一个空列表。