Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 链接页面中的刮擦额外字段_Python_Xpath_Lxml_Scrapy - Fatal编程技术网

Python 链接页面中的刮擦额外字段

Python 链接页面中的刮擦额外字段,python,xpath,lxml,scrapy,Python,Xpath,Lxml,Scrapy,我试图在首页上刮一些帖子,那里几乎有我需要的所有东西。但是在链接(ed)页面上,我还需要一个日期字段。我尝试了以下回调: from scrapy.spider import BaseSpider from macnn_com.items import MacnnComItem from scrapy.selector import HtmlXPathSelector from scrapy.contrib.loader import XPathItemLoader from scrapy.co

我试图在首页上刮一些帖子,那里几乎有我需要的所有东西。但是在链接(ed)页面上,我还需要一个日期字段。我尝试了以下回调:

from scrapy.spider import BaseSpider
from macnn_com.items import MacnnComItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
from scrapy.http.request import Request

class MacnnSpider(BaseSpider):
    name = 'macnn_com'
    allowed_domains = ['macnn.com']
    start_urls = ['http://www.macnn.com']
    posts_list_xpath = '//div[@class="post"]'
    item_fields = { 'title': './/h1/a/text()',
                    'link': './/h1/a/@href',
                    'summary': './/p/text()',
                    'image': './/div[@class="post_img"]/div[@class="post_img_border"]/a/img/@original' }

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        # iterate over posts
        for qxs in hxs.select(self.posts_list_xpath):
            loader = XPathItemLoader(MacnnComItem(), selector=qxs)

            # define processors
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # skip posts with empty titles
            if loader.get_xpath('.//h1/a/text()') == []:
                continue
            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
            request = Request(loader.get_xpath('.//h1/a/@href')[0], callback=self.parse_link,meta={'loader':loader})
            yield request
            #loader.add_value('datums',request)
            yield loader.load_item()

    def parse_link(self, response):
        loader = response.meta["loader"]
        hxs = HtmlXPathSelector(response)
        hero = hxs.select("//div[@class='post_header']/h2/text()").extract()
        loader.add_value('datums',hero)
        return loader
但我会犯这样的错误

错误:Spider必须返回请求,BaseItem或None,在


我做错了什么?

parse\u link
需要返回项目,而不是加载程序

def parse_link(self, response):
    loader = response.meta["loader"]
    hxs = HtmlXPathSelector(response)
    hero = hxs.select("//div[@class='post-header']/h2/text()").extract()
    loader.add_value('datums',hero)
    return loader.load_item()

我不再有这个错误了,但是我也没有得到物品“datums”(顺便说一句,有一个类型(post header到post_header)。但是加载器。add_value('datums',hero)不会给物品添加任何东西。我看到我在英雄声明之后做了一个“print hero”(打印英雄),这可能摧毁了我的英雄。删除了它并像一个符咒一样工作。还删除了在主解析函数中使用“yield loader.load_item()”,否则我会两次获取所有内容(带和不带datums字段)。