Python 刮皮不';t在我已成功从中提取其他数据的同一页面中,从特定字段中提取数据
事实上,我对Scrapy很陌生,我不知道为什么我得不到我想要的信息。我正在使用www.kayak.com网站上的Scrapy,我想提取纽约所有酒店的入住和退房时间。我已成功地从签入和签出时间所在的同一页面中提取数据,但无法提取这两个字段的数据 我的代码如下所示:Python 刮皮不';t在我已成功从中提取其他数据的同一页面中,从特定字段中提取数据,python,python-2.7,scrapy,scrapy-spider,Python,Python 2.7,Scrapy,Scrapy Spider,事实上,我对Scrapy很陌生,我不知道为什么我得不到我想要的信息。我正在使用www.kayak.com网站上的Scrapy,我想提取纽约所有酒店的入住和退房时间。我已成功地从签入和签出时间所在的同一页面中提取数据,但无法提取这两个字段的数据 我的代码如下所示: import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from hotel_
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from hotel_crawl.items import HotelCrawlItem
from bs4 import BeautifulSoup
import time
import urlparse
class MySpider(CrawlSpider):
name = "kayaksite"
allowed_domains = ["www.kayak.com"]
start_urls = ["http://www.kayak.com/New-York-Hotels.15830.hotel.ksp"]
rules = (
Rule(LinkExtractor(
restrict_xpaths=("//a[@class='actionlink pagenumber' [contains(text(),'Next')]", )), callback="parse_item", follow=True),
def parse_start_url(self, response):
print "test"
self.logger.info('Hi, this is an item page! %s', response.url)
item = HotelCrawlItem()
name = response.xpath("//a[@class='hotelname hotelresultsname']//text()").extract()
price = [BeautifulSoup(i).get_text() for i in response.xpath("//div[@class='pricerange']").extract()]
review = response.xpath("//a[@class='reviewsoverview']/strong/text()").extract()
url = response.xpath("//a[@class='hotelname hotelresultsname']//@href").extract()
alldata = zip(name, price, review, url)
for i in alldata:
item['name'] = i[0]
item['price'] = i[1]
item['review'] = i[2]
request = scrapy.Request(urlparse.urljoin(response.url, i[3]), callback=self.parse_item2)
request.meta['item'] = item
yield request
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
item = HotelCrawlItem()
name = response.xpath("//a[@class='hotelname hotelresultsname']//text()").extract()
price = [BeautifulSoup(i).get_text() for i in response.xpath("//div[@class='pricerange']").extract()]
review = response.xpath("//a[@class='reviewsoverview']/strong/text()").extract()
url = response.xpath("//a[@class='hotelname hotelresultsname']//@href").extract()
alldata = zip(name, price, review, url)
for i in alldata:
item['name'] = i[0]
item['price'] = i[1]
item['review'] = i[2]
request = scrapy.Request(urlparse.urljoin(response.url, i[3]), callback=self.parse_item2)
request.meta['item'] = item
yield request
def parse_item2(self, response):
print "test--------------"
self.logger.info('Hi, this is an item page! %s', response.url)
item = response.meta['item']
item['location'] = response.xpath("//*[@id='detailsOverviewContactInfo']/div/span/span[1]/text()").extract()
item['postcode'] = response.xpath("//*[@id='detailsOverviewContactInfo']/div/span/span[3]/text()").extract()
item['check_in'] = response.xpath("//*[@id='goodToKnow']/div/div[2]/div[2]/text()").extract()
item['check_out'] = response.xpath("//*[@id='goodToKnow']/div/div[2]/div[2]/text()").extract()
yield item
您的签入、签出x路径未返回任何值。对于其他属性,如位置和邮政编码,您的x路径工作正常。此外,它们是该页面中的两个签入和签出数据点[。请在[//输入[@name='checkin_date']/@value]中尝试以下用于chek的xpath下面是价格。哦,我正试图从“好消息”部分获得入住和退房时间,正如你在“便利设施”下面看到的那样节。不是输入字段。您的签入、签出x路径没有返回任何值。对于其他属性,如位置和邮政编码,您的x路径工作正常。此外,它们是该页面中的两个签入和签出数据点[。请尝试以下xpath For chek in[/input[@name='checkin_date']/@value]下面是获取房价。哦,我试图从“好消息”部分获取入住和退房时间,正如你在“便利设施”部分下面看到的那样。而不是输入字段。