Python 为什么我的刮痧蜘蛛没有按预期运行?
当我运行下面的代码时,我得到一个文件,该文件包含第二个代码块中的所有预期数据,但第一个代码块中没有任何数据。换句话说,从EventLocation到EventURL的所有数据都存在,但从EventArtister到EventDetails的数据都不存在。我需要修改什么才能使其正常工作Python 为什么我的刮痧蜘蛛没有按预期运行?,python,python-2.7,web-scraping,scrapy,scraper,Python,Python 2.7,Web Scraping,Scrapy,Scraper,当我运行下面的代码时,我得到一个文件,该文件包含第二个代码块中的所有预期数据,但第一个代码块中没有任何数据。换句话说,从EventLocation到EventURL的所有数据都存在,但从EventArtister到EventDetails的数据都不存在。我需要修改什么才能使其正常工作 import urlparse from scrapy.http import Request from scrapy.spider import BaseSpider from scrapy.selector i
import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
#from NT.items import NowTorontoItem
from scrapy.item import Item, Field
class NowTorontoItem(Item):
eventArtist = Field()
eventTitle = Field()
eventHolder = Field()
eventDetails = Field()
eventLocation = Field()
eventOrganization = Field()
eventName = Field()
eventAddress = Field()
eventLocality = Field()
eventPostalCode = Field()
eventPhone = Field()
eventURL = Field()
class MySpider(BaseSpider):
name = "NTSpider"
allowed_domains = ["nowtoronto.com"]
start_urls = ["http://www.nowtoronto.com/music/listings/"]
def parse(self, response):
selector = Selector(response)
listings = selector.css("div.listing-item0, div.listing-item1")
for listing in listings:
item = NowTorontoItem()
for body in listing.css('span.listing-body > div.List-Body'):
item ["eventArtist"] = body.css("span.List-Name::text").extract()
item ["eventTitle"] = body.css("span.List-Body-Emphasis::text").extract()
item ["eventHolder"] = body.css("span.List-Body-Strong::text").extract()
item ["eventDetails"] = body.css("::text").extract()
# yield a Request()
# so that scrapy enqueues a new page to fetch
detail_url = listing.css("div.listing-readmore > a::attr(href)")
if detail_url:
yield Request(urlparse.urljoin(response.url,
detail_url.extract()[0]),
callback=self.parse_details)
def parse_details(self, response):
self.log("parse_details: %r" % response.url)
selector = Selector(response)
listings = selector.css("div.whenwhereContent")
for listing in listings:
for body in listing.css('td.small-txt.dkgrey-txt.rightInfoTD'):
item = NowTorontoItem()
item ["eventLocation"] = body.css("span[property='v:location']::text").extract()
item ["eventOrganization"] = body.css("span[property='v:organization'] span[property='v:name']::text").extract()
item ["eventName"] = body.css("span[property='v:name']::text").extract()
item ["eventAddress"] = body.css("span[property='v:street-address']::text").extract()
item ["eventLocality"] = body.css("span[property='v:locality']::text").extract()
item ["eventPostalCode"] = body.css("span[property='v:postal-code']::text").extract()
item ["eventPhone"] = body.css("span[property='v:tel']::text").extract()
item ["eventURL"] = body.css("span[property='v:url']::text").extract()
yield item
编辑
它现在似乎正在运行,但有一个小问题。对于每个事件,它返回两行,一行包含所有详细信息,一行仅包含从第一个代码块提取的详细信息,或三行,一行包含所有详细信息,两行相同,仅包含从第一个代码块提取的详细信息
这是第一种情况的一个例子
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [u'875 Bloor W'],
'eventArtist': [u'Andria Simone & Those Guys'],
'eventDetails': [u'Andria Simone & Those Guys',
u' (pop/soul) ',
u'Baltic Avenue',
u' 8 pm, $15.'],
'eventHolder': [u'Baltic Avenue'],
'eventLocality': [u'Toronto'],
'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
'eventPhone': [u'647-898-5324'],
'eventPostalCode': [u'M6G 3T6'],
'eventTitle': [],
'eventURL': []}
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Andria Simone & Those Guys'],
'eventDetails': [u'Andria Simone & Those Guys',
u' (pop/soul) ',
u'Baltic Avenue',
u' 8 pm, $15.'],
'eventHolder': [u'Baltic Avenue'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [],
'eventURL': []}
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [u'11 Polson'],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [u'Toronto'],
'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
'eventPhone': [u'416-461-3625'],
'eventPostalCode': [u'M5A 1A4'],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': [u'sound-academy.com']}
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Crawled (200) <GET http://www.nowtoronto.com/music/listings/listing.cfm?listingid=122291&subsection=&category=&criticspicks=&date1=&date2=&locationId=0> (referer: http://www.nowtoronto.com/music/listings/)
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': []}
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': []}
2014-03-21 11:12:40-0400[NTSpider]调试:解析\u详细信息:http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:12:40-0400[NTSpider]调试:从
{'eventAddress':[u'875 Bloor W'],
“事件艺术家”:[u'Andria Simone和那些家伙],
“事件详情”:[u'Andria Simone和那些家伙”,
(流行音乐/灵魂音乐),
u'波罗的海大道',
下午8点,15美元。],
“事件持有人”:[u'Baltic Avenue'],
“EventLocation”:[u'Toronto'],
“eventLocation':[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName':[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t高原大道'],
“事件组织”:[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t阿尔泰大道'],
“事件电话”:[u'647-898-5324'],
“eventPostalCode”:[u'M6G 3T6'],
“eventTitle”:[],
“eventURL”:[]号
2014-03-21 11:12:40-0400[NTSpider]调试:从
{'eventAddress':[],
“事件艺术家”:[u'Andria Simone和那些家伙],
“事件详情”:[u'Andria Simone和那些家伙”,
(流行音乐/灵魂音乐),
u'波罗的海大道',
下午8点,15美元。],
“事件持有人”:[u'Baltic Avenue'],
“EventLocation”:[],
“eventLocation”:[],
“eventName”:[],
“事件组织”:[],
“eventPhone”:[],
“eventPostalCode”:[],
“eventTitle”:[],
“eventURL”:[]号
这是第二种情况的一个例子
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [u'875 Bloor W'],
'eventArtist': [u'Andria Simone & Those Guys'],
'eventDetails': [u'Andria Simone & Those Guys',
u' (pop/soul) ',
u'Baltic Avenue',
u' 8 pm, $15.'],
'eventHolder': [u'Baltic Avenue'],
'eventLocality': [u'Toronto'],
'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBaltic Avenue'],
'eventPhone': [u'647-898-5324'],
'eventPostalCode': [u'M6G 3T6'],
'eventTitle': [],
'eventURL': []}
2014-03-21 11:12:40-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=129761&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Andria Simone & Those Guys'],
'eventDetails': [u'Andria Simone & Those Guys',
u' (pop/soul) ',
u'Baltic Avenue',
u' 8 pm, $15.'],
'eventHolder': [u'Baltic Avenue'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [],
'eventURL': []}
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: parse_details: 'http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [u'11 Polson'],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [u'Toronto'],
'eventLocation': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
'eventOrganization': [u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSound Academy'],
'eventPhone': [u'416-461-3625'],
'eventPostalCode': [u'M5A 1A4'],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': [u'sound-academy.com']}
2014-03-21 11:21:23-0400 [NTSpider] DEBUG: Crawled (200) <GET http://www.nowtoronto.com/music/listings/listing.cfm?listingid=122291&subsection=&category=&criticspicks=&date1=&date2=&locationId=0> (referer: http://www.nowtoronto.com/music/listings/)
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': []}
2014-03-21 11:21:24-0400 [NTSpider] DEBUG: Scraped from <200 http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0>
{'eventAddress': [],
'eventArtist': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy '],
'eventDetails': [u'Danny Byrd, S.P.Y., Fred V & Grafix, Marcus Visionary, Lushy ',
u'Bassweek: Projek-Hospitality ',
u'Sound Academy',
u' $35 or wristband TM.'],
'eventHolder': [u'Sound Academy'],
'eventLocality': [],
'eventLocation': [],
'eventName': [],
'eventOrganization': [],
'eventPhone': [],
'eventPostalCode': [],
'eventTitle': [u'Bassweek: Projek-Hospitality '],
'eventURL': []}
2014-03-21 11:21:23-0400[NTSpider]调试:解析\u详细信息:http://www.nowtoronto.com/music/listings/listing.cfm?listingid=130096&subsection=&category=&criticspicks=&date1=&date2=&locationId=0'
2014-03-21 11:21:23-0400[NTSpider]调试:从
{'eventAddress':[u'11 Polson'],
《事件艺术家》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionar,Lushy'],
《事件详情》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionary,Lushy',
u'Bassweek:Projek Hospitality',
u‘桑德学院’,
u'$35或腕带TM.],
“活动负责人”:[u'Sound Academy'],
“EventLocation”:[u'Toronto'],
“eventLocation':[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t'],
'eventName':[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t声音学院'],
'eventOrganization':[u'\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t声音学院'],
“事件电话”:[u'416-461-3625'],
“eventPostalCode”:[u'M5A 1A4'],
“eventTitle”:[u'Bassweek:Projek Hospitality'],
'eventURL':[u'sound-academy.com']}
2014-03-21 11:21:23-0400[NTSpider]调试:爬网(200)(参考:http://www.nowtoronto.com/music/listings/)
2014-03-21 11:21:24-0400[NTSpider]调试:从
{'eventAddress':[],
《事件艺术家》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionar,Lushy'],
《事件详情》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionary,Lushy',
u'Bassweek:Projek Hospitality',
u‘桑德学院’,
u'$35或腕带TM.],
“活动负责人”:[u'Sound Academy'],
“EventLocation”:[],
“eventLocation”:[],
“eventName”:[],
“事件组织”:[],
“eventPhone”:[],
“eventPostalCode”:[],
“eventTitle”:[u'Bassweek:Projek Hospitality'],
“eventURL”:[]号
2014-03-21 11:21:24-0400[NTSpider]调试:从
{'eventAddress':[],
《事件艺术家》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionar,Lushy'],
《事件详情》:[u'Danny Byrd,S.P.Y.,Fred V&Grafix,Marcus Visionary,Lushy',
u'Bassweek:Projek Hospitality',
u‘桑德学院’,
u'$35或腕带TM.],
“活动负责人”:[u'Sound Academy'],
“EventLocation”:[],
“eventLocation”:[],
“eventName”:[],
“事件组织”:[],
“eventPhone”:[],
“eventPostalCode”:[],
“eventTitle”:[u'Bassweek:Projek Hospitality'],
“eventURL”:[]号
您应该将项目从parse()
传递到Request
的参数中的parse\u details()
:
yield Request(urlparse.urljoin(response.url,
detail_url.extract()[0]),
meta={'item': item},
callback=self.parse_details)
然后,在parse_details()
中,您可以从response.meta['item']
()获取项目
此外,如果未找到详细信息,您可能希望生成项目:
if detail_url:
yield Request(urlparse.urljoin(response.url,
detail_url.extract()[0]),
meta={'item': item},
callback=self.parse_details)
else:
yield item
希望这能有所帮助。当你说“然后,在parse_details()
中,你可以从response.meta['item']
”中获取项目时,你是否建议我需要在代码块之外更改代码,以if detail\u url:
开始,以yield item
结束,我只是将if detail\u url:
部分修改为您的代码结构,但它仍然只从第二个代码中提取数据block@zgall1是的,这个想法是将您在parse()
中初始化的项传递给parse\u details()
这样您就可以继续填写详细信息,并在填写之后做出让步。@Alexe那么我该怎么做呢?我是否将其作为参数传递给def parse_details(self,response):
?@alexcde我觉得我在这里有些不知所措。你能再具体一点吗?我不明白。顺便说一下,谢谢。我同意