Python scrapy给了我一个不完整的链接,我需要解析内部页面
所以,从技术上讲,Scrapy在我告诉它scrape时给了我正确的信息:Python scrapy给了我一个不完整的链接,我需要解析内部页面,python,hyperlink,scrapy,Python,Hyperlink,Scrapy,所以,从技术上讲,Scrapy在我告诉它scrape时给了我正确的信息: link = row.xpath('.//p/a/@href').extract_first() 问题是我得到了“/biz/polkadog bakery boston?osq=Dog”,如HTML代码所示(参考图1),但我想要得到(图中的2)”,这仅在我将鼠标悬停在“链接”上时才会显示 我想得到这个,这样我就可以解析内部页面中的信息 我试着找像这样的东西,但我没有运气 如果我不够清楚,请在给我一个坏的利率之前让我知道
link = row.xpath('.//p/a/@href').extract_first()
问题是我得到了“/biz/polkadog bakery boston?osq=Dog”,如HTML代码所示(参考图1),但我想要得到(图中的2)”,这仅在我将鼠标悬停在“链接”上时才会显示
我想得到这个,这样我就可以解析内部页面中的信息
我试着找像这样的东西,但我没有运气
如果我不够清楚,请在给我一个坏的利率之前让我知道
谢谢
下面是完整的蜘蛛:
from scrapy import Spider
from yelp.items import YelpItem
import scrapy
import re
class YelpSpider(Spider):
name = "yelp"
allowed_domains = ['www.yelp.com']
# Defining the list of pages to scrape
start_urls = ["https://www.yelp.com/search?find_desc=Dog&find_loc=Boston%2C%20MA&start=" + str(10 * i) for i in range(0, 1)]
def parse(self, response):
# Defining rows to be scraped
rows = response.xpath('//*[@id="wrap"]/div[3]/div[2]/div[2]/div/div[1]/div[1]/div/ul/li')
for row in rows:
# Scraping Busines' Name
name = row.xpath('.//p/a/text()').extract_first()
# Scraping Phone number
phone = row.xpath('.//div[1]/p[1][@class= "lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--right__373c0__3ARv7"]/text()').extract_first()
# scraping area
area = row.xpath('.//p/span[@class = "lemon--span__373c0__3997G"]/text()').extract_first()
# Scraping services they offer
services = row.xpath('.//a[@class="lemon--a__373c0__IEZFH link__373c0__29943 link-color--inherit__373c0__15ymx link-size--default__373c0__1skgq"]/text()').extract_first()
# Extracting internal link
link = row.xpath('.//p/a/@href').extract_first()
item = YelpItem()
item['name'] = name
item['phone'] = phone
item['area'] = area
item['services'] = services
item['link'] = link
yield item
def parse_detail(self, response):
item = response.meta['item']
address = response.xpath('.//*[@id="wrap"]/div[2]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[1]/div/strong/address/text()[1]').extract_first()
item['address'] = address
yield item
您需要使用
response.urljoin()
:
link = row.xpath('.//p/a/@href').extract_first()
link = response.urljoin(link)