Python 来自Scrapy的CSV和JSON的空输出文件
所以我一直在为我的学校做这个项目,我遇到了一个问题,结果文件中没有输出。我试图找到解决办法。他们似乎都不为我工作 以下是我的Scrapy蜘蛛代码:Python 来自Scrapy的CSV和JSON的空输出文件,python,json,csv,scrapy,scrapy-spider,Python,Json,Csv,Scrapy,Scrapy Spider,所以我一直在为我的学校做这个项目,我遇到了一个问题,结果文件中没有输出。我试图找到解决办法。他们似乎都不为我工作 以下是我的Scrapy蜘蛛代码: from scrapy.contrib.linkextractors import LinkExtractor from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor from scrapy.contrib.spiders import Rule, CrawlSpide
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.selector import Selector
from scrapy_craigslist.items import ScrapyCraigslistItem
class MySpider(CrawlSpider):
name = 'craigslist'
allowed_domains = ['vancouver.craigslist.ca']
start_urls = ['https://vancouver.craigslist.ca/search/apa?']
rules = (
Rule(LxmlLinkExtractor(
allow=(r'vancouver.craigslist.ca/search/apa.*'),
deny = (r'.*format\=rss.*')
),
callback="parse_items_1",
follow= False,
),
)
def parse_items_1(self, response):
self.logger.info('You are now crawling: %s', response.url)
items = []
hxs = Selector(response)
contents = hxs.xpath("//div[@class='rows']/*")
for content in contents:
item = ScrapyCraigslistItem()
item ["title"] = content.xpath("//p/span/span/a/span/text()").extract()[0]
k = content.xpath("//p/a/@href").extract()[0]
item ['ad_url'] = 'https://vancouver.craigslist.ca{}'.format(''.join(k))
item ["post_date"] = content.xpath("//p/span/span/time/text()").extract()[0]
item ["post_date_specific"] = content.xpath("//p/span/span/time/@datetime").extract()[0]
item ["price"] = content.xpath("//p/span/span[@class='l2']/span/text()").extract()[0]
item ["location"] = content.xpath("//p/span/span[@class='l2']/span[@class='pnr']/small/text()").extract()[0].strip()
return item
对于my items.py脚本,它是:
import scrapy
class ScrapyCraigslistItem(scrapy.Item):
title = scrapy.Field()
post_date = scrapy.Field()
post_date_specific = scrapy.Field()
price = scrapy.Field()
location = scrapy.Field()
ad_url = scrapy.Field()
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.titles_seen = set()
def process_item(self, item, spider):
if item['title'] in self.titles_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.titles_seen.add(item['title'])
return item
对于my pipelines.py脚本,它是:
import scrapy
class ScrapyCraigslistItem(scrapy.Item):
title = scrapy.Field()
post_date = scrapy.Field()
post_date_specific = scrapy.Field()
price = scrapy.Field()
location = scrapy.Field()
ad_url = scrapy.Field()
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.titles_seen = set()
def process_item(self, item, spider):
if item['title'] in self.titles_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.titles_seen.add(item['title'])
return item
对于my settings.py,它是:
BOT_NAME = 'scrapy_craigslist'
SPIDER_MODULES = ['scrapy_craigslist.spiders']
NEWSPIDER_MODULE = 'scrapy_craigslist.spiders'
ITEM_PIPELINES = {
'scrapy_craigslist.pipelines.DuplicatesPipeline': 10,
}
USER_AGENT = "Mozilla/5.0 (Windows NT 5.1; rv:12.2.1) Gecko/20120616 Firefox/12.2.1 PaleMoon/12.2.1"
最后,这是我在JSON和CSV命令行中得到的结果
/home/logan/Desktop/scrapy-craigslist-master/scrapy_craigslist/spiders/craigslist_scrapy.py:1: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors import LinkExtractor
/home/logan/Desktop/scrapy-craigslist-master/scrapy_craigslist/spiders/craigslist_scrapy.py:2: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.lxmlhtml` is deprecated, use `scrapy.linkextractors.lxmlhtml` instead
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
/home/logan/Desktop/scrapy-craigslist-master/scrapy_craigslist/spiders/craigslist_scrapy.py:3: ScrapyDeprecationWarning: Module `scrapy.contrib.spiders` is deprecated, use `scrapy.spiders` instead
from scrapy.contrib.spiders import Rule, CrawlSpider
2017-04-15 14:12:44 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapy_craigslist)
2017-04-15 14:12:44 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_craigslist.spiders', 'FEED_URI': 'load.csv', 'SPIDER_MODULES': ['scrapy_craigslist.spiders'], 'BOT_NAME': 'scrapy_craigslist', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 5.1; rv:12.2.1) Gecko/20120616 Firefox/12.2.1 PaleMoon/12.2.1', 'FEED_FORMAT': 'csv'}
2017-04-15 14:12:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-04-15 14:12:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-15 14:12:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-15 14:12:44 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy_craigslist.pipelines.DuplicatesPipeline']
2017-04-15 14:12:44 [scrapy.core.engine] INFO: Spider opened
2017-04-15 14:12:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-15 14:12:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-15 14:12:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa> (referer: None)
2017-04-15 14:12:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa
2017-04-15 14:12:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://vancouver.craigslist.ca/search/apa?sale_date=2017-04-15&sort=upcoming> from <GET https://vancouver.craigslist.ca/search/apa?sort=upcoming>
2017-04-15 14:12:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa?sort=priceasc> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa?s=120> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa?sort=priceasc
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa?s=120
2017-04-15 14:12:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa?sale_date=2017-04-15&sort=upcoming> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa?sort=date> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://vancouver.craigslist.ca/search/apa?sort=pricedsc> (referer: https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa?sale_date=2017-04-15&sort=upcoming
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa?sort=date
2017-04-15 14:12:46 [craigslist] INFO: You are now crawling: https://vancouver.craigslist.ca/search/apa?sort=pricedsc
2017-04-15 14:12:46 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-15 14:12:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2728,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 241678,
'downloader/response_count': 8,
'downloader/response_status_count/200': 7,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 15, 21, 12, 46, 796319),
'log_count/DEBUG': 9,
'log_count/INFO': 13,
'request_depth_max': 1,
'response_received_count': 7,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'start_time': datetime.datetime(2017, 4, 15, 21, 12, 44, 841317)}
2017-04-15 14:12:46 [scrapy.core.engine] INFO: Spider closed (finished)
/home/logan/Desktop/scrapy craigslist master/scrapy_craigslist/spider/craigslist_scrapy.py:1:ScrapydePrecision警告:模块'scrapy.contrib.linkextractors'不推荐使用,请改用'scrapy.linkextractors'
从scrapy.contrib.LinkExtractor导入LinkExtractor
/home/logan/Desktop/scrapy craigslist master/scrapy_craigslist/spiders/craigslist_scrapy.py:2:ScrapydePrecision警告:模块'scrapy.contrib.LinkedExtractors.lxmlhtml'已被弃用,请改用'scrapy.LinkedExtractors.lxmlhtml'
从scrapy.contrib.linkextractors.lxmlhtml导入lxmlinkextractor
/home/logan/Desktop/scrapy craigslist master/scrapy_craigslist/spiders/craigslist_scrapy.py:3:ScrapydePrecision警告:模块'scrapy.contrib.spiders'不推荐使用,请改用'scrapy.spiders'
从scrapy.contrib.spider导入规则,爬行爬行器
2017-04-15 14:12:44[scrapy.utils.log]信息:scrapy 1.3.3已启动(机器人:scrapy_craigslist)
2017-04-15 14:12:44[scrapy.utils.log]信息:覆盖设置:{'NEWSPIDER_MODULE':'scrapy_craigslist.SPIDER','FEED_URI':'load.csv','SPIDER_MODULES':['scrapy_craigslist.SPIDER'],'BOT_NAME':'scrapy_craigslist','USER_-AGENT':'Mozilla/5.0(Windows NT 5.1;rv:12.2.1)Gecko/20120616 Firefox/12.2.1 Palemon/12.2.1,'FEED格式:'csv''
2017-04-15 14:12:44[剪贴簿中间件]信息:启用的扩展:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.logstats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.corestats']
2017-04-15 14:12:44[剪贴簿中间件]信息:启用的下载程序中间件:
['scrapy.downloaderMiddleware.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloaderMiddleware.defaultheaders.DefaultHeadersMiddleware',
'scrapy.DownloaderMiddleware.useragent.UserAgentMiddleware',
'scrapy.DownloaderMiddleware.retry.RetryMiddleware',
'scrapy.DownloaderMiddleware.redirect.MetaRefreshMiddleware',
'scrapy.DownloaderMiddleware.httpcompression.HttpCompressionMiddleware',
'scrapy.DownloaderMiddleware.redirect.RedirectMiddleware',
“scrapy.DownloaderMiddleware.cookies.CookiesMiddleware”,
'scrapy.downloadermiddleware.stats.DownloaderStats']
2017-04-15 14:12:44[剪贴簿中间件]信息:启用的蜘蛛中间件:
['scrapy.spidermiddleware.httperror.httperror中间件',
'刮皮.SpiderMiddleware.场外.场外Iddleware',
“scrapy.Spidermiddleware.referer.RefererMiddleware”,
'scrapy.spiderMiddleware.urllength.UrlLengthMiddleware',
'scrapy.spidermiddleware.depth.DepthMiddleware']
2017-04-15 14:12:44[碎片中间件]信息:启用的项目管道:
['scrapy_craigslist.pipelines.DuplicatesPipeline']
2017-04-15 14:12:44[刮屑.核心.发动机]信息:十字轴已打开
2017-04-15 14:12:44[scrapy.extensions.logstats]信息:爬网0页(0页/分钟),爬网0项(0项/分钟)
2017-04-15 14:12:44[scrapy.extensions.telnet]调试:telnet控制台在127.0.0.1:6023上侦听
2017-04-15 14:12:45[刮屑核心引擎]调试:爬网(200)(参考:无)
2017-04-15 14:12:45[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa
2017-04-15 14:12:46[scrapy.DownloaderMiddleware.redirect]调试:重定向(302)到
2017-04-15 14:12:46[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa?sort=priceasc
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa?s=120
2017-04-15 14:12:46[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[刮屑核心引擎]调试:爬网(200)(参考:https://vancouver.craigslist.ca/search/apa)
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa?sale_date=2017-04-15&排序=即将到来
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa?sort=date
2017-04-15 14:12:46[craigslist]信息:您现在正在爬行:https://vancouver.craigslist.ca/search/apa?sort=pricedsc
2017-04-15 14:12:46[刮屑芯发动机]信息:关闭卡盘(已完成)
2017-04-15 14:12:46[scrapy.statscollectors]信息:倾销scrapy统计数据:
{'downloader/request_bytes':2728,
“下载程序/请求计数”:8,
“下载程序/请求方法/计数/获取”:8,
“downloader/response_字节”:241678,
“下载程序/响应计数”:8,
“下载/响应状态\计数/200”:7,
“下载程序/响应状态\计数/302”:1,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2017,4,15,21,12,46,796319),
“日志计数/调试”:9,
“日志计数/信息”:13,
“请求深度最大值”:1,
“收到的响应数”:7,
“调度程序/出列”:8,
“调度程序/出列/内存”:8,
“调度程序/排队”:8,
“调度程序/排队/内存”:8,
“开始时间”:datetime.datetime(2017,4,15,21,12,44,841317)}
2017-04-15 14:12:46[刮屑堆芯发动机]信息:十字轴关闭(完成)
很抱歉请求太长,谢谢。该页面没有