Python scrapy导出空csv

Python scrapy导出空csv,python,csv,xpath,scrapy,Python,Csv,Xpath,Scrapy,我的问题是:刮擦导出空csv 我的代码结构形状: items.py: import scrapy class BomnegocioItem(scrapy.Item): title = scrapy.Field() pass pipelines.py: class BomnegocioPipeline(object): def process_item(self, item, spider): return item settings.py: BOT

我的问题是:刮擦导出空csv

我的代码结构形状:

items.py:

import scrapy


class BomnegocioItem(scrapy.Item):
    title = scrapy.Field()
    pass
pipelines.py:

class BomnegocioPipeline(object):
    def process_item(self, item, spider):
        return item
settings.py:

BOT_NAME = 'bomnegocio'

SPIDER_MODULES = ['bomnegocio.spiders']
NEWSPIDER_MODULE = 'bomnegocio.spiders'
LOG_ENABLED = True
bomnegocioSpider.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from bomnegocio.items  import BomnegocioItem
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
import csv
import urllib2

class bomnegocioSpider(CrawlSpider):

    name = 'bomnegocio'
    allowed_domains = ["http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    rules = (Rule (SgmlLinkExtractor(allow=r'/fogao')
    , callback="parse_bomnegocio", follow= True),
    )

    print "=====> Start data extract ...."

    def parse_bomnegocio(self,response):                                                     
        #hxs = HtmlXPathSelector(response)

        #items = [] 
        item = BomnegocioItem()     

        item['title'] = response.xpath("//*[@id='ad_title']/text()").extract()[0]                        
        #items.append(item)

        return item

    print "=====> Finish data extract."     

    #//*[@id="ad_title"]
终端:

$ scrapy crawl bomnegocio -o dataextract.csv -t csv

=====> Start data extract ....
=====> Finish data extract.
2014-12-12 13:38:45-0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: bomnegocio)
2014-12-12 13:38:45-0200 [scrapy] INFO: Optional features available: ssl, http11
2014-12-12 13:38:45-0200 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bomnegocio.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['bomnegocio.spiders'], 'FEED_URI': 'dataextract.csv', 'BOT_NAME': 'bomnegocio'}
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-12 13:38:45-0200 [scrapy] INFO: Enabled item pipelines: 
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider opened
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-12 13:38:45-0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Crawled (200) <GET http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713> (referer: None)
2014-12-12 13:38:45-0200 [bomnegocio] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?t=&u=http%3A%2F%2Fsp.bomnegocio.com%2Fregiao-de-bauru-e-marilia%2Feletrodomesticos%2Ffogao-industrial-itajobi-4-bocas-c-forno-54183713>
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Closing spider (finished)
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 308,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 8503,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 538024),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'offsite/domains': 1,
     'offsite/filtered': 1,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 12, 12, 15, 38, 45, 119067)}
2014-12-12 13:38:45-0200 [bomnegocio] INFO: Spider closed (finished)
看起来是空的=

我做了一些假设:

我的爬网语句提供了错误的xpath? 我去终点站打字

$ scrapy shell "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    >>> response.xpath("//*[@id='ad_title']/text()").extract()[0] 
u'\n\t\t\t\n\t\t\t\tFog\xe3o industrial itajobi 4 bocas c/ forno \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t- '
回答:不,问题不在xpath语句中

我的进口? 在日志视图上,不显示导入问题


感谢您的关注,我现在期待着听取您的意见。

此蜘蛛存在一些问题:

1允许的\u域用于域,因此您要使用:

allowed_domains = ["bomnegocio.com"]
2规则的使用在这里不是很充分,因为它们是用来定义如何对站点进行爬网的——应该遵循哪些链接。在这种情况下,您不需要遵循任何链接,您只需要直接从start_URL中列出的URL中刮取数据,因此我建议您去掉rules属性,让爬行器扩展scrapy.spider,并在默认回调解析中刮取数据:


还要注意print语句现在如何在回调中,以及使用yield而不是return,它允许您从一个页面生成多个项目。

@PedroCastro谢谢,不客气!:请在投票按钮下方的小v处将我的问题标记为已接受。顺便说一句,你可以在网站上用葡萄牙语提问关于Scrapy的问题,如果你愿意,我会看Scrapy标签的;
allowed_domains = ["bomnegocio.com"]
from testing.items import BomnegocioItem
import scrapy

class bomnegocioSpider(scrapy.Spider):

    name = 'bomnegocio'
    allowed_domains = ["bomnegocio.com"]
    start_urls = [
    "http://sp.bomnegocio.com/regiao-de-bauru-e-marilia/eletrodomesticos/fogao-industrial-itajobi-4-bocas-c-forno-54183713"
    ]

    def parse(self,response):
        print "=====> Start data extract ...."
        yield BomnegocioItem(
            title=response.xpath("//*[@id='ad_title']/text()").extract()[0]
        )
        print "=====> Finish data extract."