Web crawler 刮擦:蜘蛛不生成项目信号
Windows7上的Python 2.7.6.2,使用二进制代码WinPython-32bit-2.7.6.2、Scrapy 0.22.0、Eclipse 4.2.1和Twisted-13.2.0.win32-py2.7 我在学刮痧。我让它做所有的事情,除了正确调用管道。process\u item()。它正在调用管道。打开\u spider()和管道。关闭\u spider()确定 我认为这是因为spider没有生成任何“item”信号(不是item_传递、item_丢弃或item_刮取) 我添加了一些代码来尝试捕获这些信号,但当我尝试捕获上述3个项目信号中的任何一个时,我什么也得不到 该代码捕获其他信号(如发动机启动或蜘蛛关闭等) 如果我试图设置一个item['doesnotexist']变量,它也会出错,因此它似乎在使用items文件和我的用户定义的items类“AuctionDOTcomItems” 真是不知所措。我也非常感谢任何帮助 A) 正在使pipelines.process_item()正常工作或 B) 能够手动捕获项目已设置的信号,以便将控制权传递给自己版本的pipelines.process_item() 反应器:Web crawler 刮擦:蜘蛛不生成项目信号,web-crawler,scrapy,Web Crawler,Scrapy,Windows7上的Python 2.7.6.2,使用二进制代码WinPython-32bit-2.7.6.2、Scrapy 0.22.0、Eclipse 4.2.1和Twisted-13.2.0.win32-py2.7 我在学刮痧。我让它做所有的事情,除了正确调用管道。process\u item()。它正在调用管道。打开\u spider()和管道。关闭\u spider()确定 我认为这是因为spider没有生成任何“item”信号(不是item_传递、item_丢弃或item_刮取) 我
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from scrapy.utils.project import get_project_settings
class SpiderRun:
def __init__(self, spider):
settings = get_project_settings()
mySettings = {'ITEM_PIPELINES': {'estatescraper.pipelines.EstatescraperXLSwriter':300}}
settings.overrides.update(mySettings)
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
# log.start()
reactor.run() # the script will block here until the spider_closed signal was sent
self.cleanup()
def cleanup(self):
print "SpiderRun done" #333
pass
if __name__ == "__main__":
from estatescraper import AuctionDOTcom
spider = AuctionDOTcom()
r = SpiderRun(spider)
蜘蛛网:
from scrapy.xlib.pydispatch import dispatcher
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import signals
from scrapy.spider import Spider
from auctiondotcomurls import AuctionDOTcomURLs
from auctiondotcomitems import AuctionDOTcomItems
from auctiondotcomgetitems import AuctionDOTcomGetItems
import urlparse
import time
import sys
class AuctionDOTcom(Spider):
def __init__(self,
limit = 50,
miles = 250,
zip = None,
asset_types = "",
auction_types = "",
property_types = ""):
self.name = "auction.com"
self.allowed_domains = ["auction.com"]
self.start_urls = AuctionDOTcomURLs(limit, miles, zip, asset_types,
auction_types, property_types)
dispatcher.connect(self.testsignal, signals.item_scraped)
# def _item_passed(self, item):
# print "item = ", item #333
def testsignal(self):
print "in csvwrite" #333
def parse(self, response):
sel = Selector(response)
listings = sel.xpath('//div[@class="contentDetail searchResult"]')
for listing in listings:
item = AuctionDOTcomItems()
item['propertyID'] = ''.join(set(listing.xpath('./@property-id').extract()))
print "item['propertyID'] = ", item['propertyID'] #333
# item = AuctionDOTcomGetItems(listing)
# ################
# # DEMONSTRATTION ONLY
# print "######################################"
# for i in item:
# print i + ": " + str(item[i])
next = set(sel.xpath('//a[contains(text(),"Next")]//@href').extract())
for i in next:
yield Request("http://%s/%s" % (urlparse.urlparse(response.url).hostname, i), callback=self.parse)
if __name__ == "__main__":
from estatescraper import SpiderRun
from estatescraper import AuctionDOTcom
spider = AuctionDOTcom()
r = SpiderRun(spider)
管道:
import csv
from csv import DictWriter
# class TutorialPipeline(object):
# def process_item(self, item, spider):
# return item
class EstatescraperXLSwriter(object):
def __init__(self):
print "Ive started the __init__ in the pipeline" #333
self.brandCategoryCsv = csv.writer(open('test.csv', 'wb'),
delimiter=',',
quoting=csv.QUOTE_MINIMAL)
self.brandCategoryCsv.writerow(['Property ID', 'Asset Type'])
def open_spider(self, spider):
print "Hit open_spider in EstatescraperXLSwriter" #333
def process_item(self, item, spider):
print "attempting to run process_item" #333
self.brandCategoryCsv.writerow([item['propertyID'],
item['assetType']])
return item
def close_spider(self, spider):
print "Hit close_spider in EstatescraperXLSwriter" #333
pass
if __name__ == "__main__":
o = EstatescraperXLSwriter()
项目:
from scrapy.item import Item, Field
class AuctionDOTcomItems(Item):
""""""
propertyID = Field() # <uniqueID>ABCD1234</uniqueID>
记录的输出:
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 40640,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)
我看不到您在def解析中生成项目,只生成请求对象。在中的某个点尝试“yield item”,以便在listings:loop–paul t中列出。2月27日17:42我没有看到您在
def parse
中生成项目,只有Request
对象。在中的某个位置尝试“yield item”以在清单中列出:
循环,就这么简单!!非常感谢保罗!我在那个简单的错误上浪费了几个小时!
Ive started the __init__ in the pipeline
Hit open_spider in EstatescraperXLSwriter
2014-02-27 17:44:12+0100 [auction.com] INFO: Closing spider (finished)
2014-02-27 17:44:12+0100 [auction.com] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 240,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 40640,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 27, 16, 44, 12, 238000),
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 2, 27, 16, 44, 9, 203000)}
2014-02-27 17:44:12+0100 [auction.com] INFO: Spider closed (finished)