Python 2.7 如何访问'；文件'；刮痧场_Python 2.7_Scrapy Spider

Python 2.7 如何访问'；文件'；刮痧场

python-2.7

Python 2.7 如何访问'；文件'；刮痧场,python-2.7,scrapy-spider,Python 2.7,Scrapy Spider,我已经使用文件管道下载了一些文件，我想获取文件字段的值。我试图打印项['files']，但它给了我一个关键错误。为什么会这样？我该怎么做 class testspider2(CrawlSpider): name = 'genspider' URL = 'flu-card.com' URLhttp = 'http://www.flu-card.com' allowed_domains = [URL] start_urls = [URLhttp] rules = ( [Rule(LxmlLi

我已经使用文件管道下载了一些文件，我想获取文件字段的值。我试图打印项['files']，但它给了我一个关键错误。为什么会这样？我该怎么做

class testspider2(CrawlSpider):
name = 'genspider'
URL = 'flu-card.com'
URLhttp = 'http://www.flu-card.com'
allowed_domains = [URL]
start_urls = [URLhttp]
rules = (
    [Rule(LxmlLinkExtractor(allow = (),restrict_xpaths = ('//a'),unique = True,),callback='parse_page',follow=True),]
)

def parse_page(self, response):
    List = response.xpath('//a/@href').extract()
    item = GenericspiderItem()
    date = strftime("%Y-%m-%d %H:%M:%S")#get date&time dd-mm-yyyy hh:mm:ss
    MD5hash = '' #store as part of the item, some links crawled are not file links so they do not have values on these fields
    fileSize = ''
    newFilePath = ''
    File = open('c:/users/kevin123/desktop//ext.txt','a')
    for links in List:
        if re.search('http://www.flu-card.com', links) is None:
            responseurl = re.sub('\/$','',response.url)
            url = urljoin(responseurl,links)
        else:
            url = links
        #File.write(url+'\n')
        filename = url.split('/')[-1]      
        fileExt = ''.join(re.findall('.{3}$',filename))
        if (fileExt != ''):
            blackList = ['tml','pdf','com','php','aspx','xml','doc']
            for word in blackList:
                if any(x in fileExt for x in blackList):
                    pass    #url is blacklisted                               
                else:                    
                    item['filename'] = filename
                    item['URL'] = url
                    item['date'] = date
                    print item['files']
                    File.write(fileExt+'\n')
                    yield GenericspiderItem(
                        file_urls=[url]
                        )
                    yield item

无法访问spider中的

项['files']

。这是因为文件是通过FilePipeline下载的，而项目在从spider中取出后才到达管道

首先生成项目，然后它进入文件管道，然后文件被加载，然后字段

images

被填充您想要的信息。要访问它，您必须编写一个管道，并在文件管道之后安排它。在管道内部，您可以访问

文件

字段

还要注意的是，在你的蜘蛛身上，你会屈服于不同种类的物品