python中返回空列表的Scrapy xpath
我不知道我做错了什么。我试图提取文本并将其存储在列表中。在firebug和firepath中,当我输入路径时,它会显示完全正确的文本。但当我应用时,它返回空列表。 我正在努力搜刮www.insider.in/mumbai。它会转到所有链接,并抓取事件标题、地址和其他信息。 这是我新编辑的代码:python中返回空列表的Scrapy xpath,python,selenium,xpath,scrapy,scrapy-spider,Python,Selenium,Xpath,Scrapy,Scrapy Spider,我不知道我做错了什么。我试图提取文本并将其存储在列表中。在firebug和firepath中,当我输入路径时,它会显示完全正确的文本。但当我应用时,它返回空列表。 我正在努力搜刮www.insider.in/mumbai。它会转到所有链接,并抓取事件标题、地址和其他信息。 这是我新编辑的代码: from scrapy.spider import BaseSpider from scrapy.selector import Selector from selenium import webdriv
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapy.selector import HtmlXPathSelector
import time
import requests
import csv
class insiderSpider(BaseSpider):
name = 'insider'
allowed_domains = ["insider.in"]
start_urls = ["http://www.insider.in/mumbai/"]
def parse(self,response):
driver = webdriver.Firefox()
print response.url
driver.get(response.url)
s = Selector(response)
#hxs = HtmlXPathSelector(response)
source_link = []
temp = []
title =""
Price = ""
Venue_name = ""
Venue_address = ""
description = ""
event_details = []
alllinks = s.xpath('//div[@class="bottom-details-right"]//a/@href').extract()
print alllinks
length_of_alllinks = len(alllinks)
for single_event in range(1,length_of_alllinks):
if "https://insider.in/event" in alllinks[single_event]:
source_link.append(alllinks[single_event])
driver.get(alllinks[single_event])
s = Selector(response)
#hxs = HtmlXPathSelector(response)
time.sleep(3)
title = s.xpath('//div[@class = "cell-title in-headerTitle"]/h1//text()').extract()
print title
temp = s.xpath('//div[@class = "cell-caption centered in-header"]//h3//text()').extract()
print temp
time.sleep(2)
a = len(s.xpath('//div[@class = "bold-caption price"]//text()').extract())
if a > 0:
Price = s.xpath('//div[@class = "bold-caption price"]//text()').extract()
time.sleep(2)
else:
Price = "RSVP"
time.sleep(2)
print Price
Venue_name = s.xpath('//div[@class = "address"]//div[@class = "section-title"]//text()').extract()
print Venue_name
Venue_address = s.xpath('//div[@class ="address"]//div//text()[preceding-sibling::br]').extract()
print Venue_address
description = s.xpath('//div[@class="cell-caption accordion-padding"]//text()').extract()
print description
time.sleep(5)
event_details.append([title,temp,Price,Venue_name,Venue_address,description])
else:
print "Other part"
编辑输出:
[u'https://insider.in/weekender-music-festival-2015', u'https://insider.in/event/east-india-comedy-presents-back-benchers#', u'https://insider.in/event/art-of-story-telling', u'https://insider.in/feelings-in-india-with-kanan-gill', u'https://insider.in/event/the-tall-tales-workshop-capture-your-story', u'https://insider.in/halloween-by-the-pier-2015', u'https://insider.in/event/whats-your-story', u'https://insider.in/event/beyond-contemporary-art']
2015-08-03 12:53:29 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:60924/hub/session/f675b909-5515-41d4-a89e-d197c296023d/url {"url": "https://insider.in/event/east-india-comedy-presents-back-benchers#", "sessionId": "f675b909-5515-41d4-a89e-d197c296023d"}
2015-08-03 12:53:29 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
[]
[]
RSVP
[]
[]
[]
[[[], [], 'RSVP', [], [], []]]
即使if条件失败,也会打印RSVP。我似乎不明白我做错了什么。我被困在这一部分已经三天了。请帮助。我删除了webdriver之类的东西,并获得了一个可以正常工作的基本代码
import scrapy
import logging
from scrapy.http import Request
from scrapy.selector import Selector
class insiderSpider(scrapy.Spider):
name = 'insider'
allowed_domains = ["insider.in"]
start_urls = ["http://www.insider.in/mumbai/"]
event_details = list() # Changed. Now event_detail is a menber data of class
def parse(self, response):
source_link = []
temp = []
title =""
Price = ""
Venue_name = ""
Venue_address = ""
description = ""
alllinks = response.xpath('//div[@class="bottom-details-right"]//a/@href').extract()
print alllinks
for single_event in alllinks:
if "https://insider.in/event" in single_event:
yield Request(url = single_event, callback = self.parse_event)
else:
print 'Other part'
def parse_event(self, response):
title = response.xpath('//div[@class = "cell-title in-headerTitle"]/h1//text()').extract()
print title
temp = response.xpath('//div[@class = "cell-caption centered in-header"]//h3//text()').extract()
print temp
a = len(response.xpath('//div[@class = "bold-caption price"]//text()').extract())
if a > 0:
Price = response.xpath('//div[@class = "bold-caption price"]//text()').extract()
else:
Price = "RSVP"
print Price
Venue_name = response.xpath('normalize-space(//div[@class = "address"]//div[@class = "section-title"]//text())').extract()
print Venue_name
Venue_address = response.xpath('normalize-space(//div[@class ="address"]//div//text()[preceding-sibling::br])').extract()
print Venue_address
description = response.xpath('normalize-space(//div[@class="cell-caption accordion-padding"]//text())').extract()
print description
self.event_details.append([title,temp,Price,Venue_name,Venue_address,description]) # Notice that event_details is used as self.event_details ie, using member data
print self.event_details # Here also self.event_details
使用webdriver是绝对必要的吗?如果是,则阅读其非强制性内容。。你能提出一些解决方案吗?…是的,它正在工作,但在最后打印事件详细信息时,它只显示上次执行的事件页面的详细信息。这是一个列表,所以它应该显示所有事件,或者如果我错了就纠正我。很抱歉,这是一个可怕的错误,我应该将event_detail实现为类的成员数据。我马上修好。感谢您指出为什么列表没有填充所有内容第一次是因为列表事件详细信息是在函数parse\u eventI alao的作用域中定义的,因此必须编写第二个parse函数来处理其他部分。@sarahjones,这超出了重点,但为什么要使用长度\u of_alllinks=len(alllinks)对于范围内的单个事件(1,所有链接的长度):是否跳过提取链接的第一个元素?