Python 如何使用Scrapy解析源代码的两个不同部分并合并结果?
我有两个蜘蛛,我目前正在运行刮一个单一的页面。三脚架如标题和详图所示。我是这样设置的,因为我不知道如何设置查询的开头(在本例中,变量名为Python 如何使用Scrapy解析源代码的两个不同部分并合并结果?,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我有两个蜘蛛,我目前正在运行刮一个单一的页面。三脚架如标题和详图所示。我是这样设置的,因为我不知道如何设置查询的开头(在本例中,变量名为listings),以允许我在一个步骤中首先刮取//div[@class='patio-head'],然后刮取//div[@class='patio-details']。有人能帮我吗?因为我想返回每个URL的名称以及一行中的所有相应详细信息?谢谢 标题 from scrapy.spider import BaseSpider from scrapy.select
listings
),以允许我在一个步骤中首先刮取//div[@class='patio-head']
,然后刮取//div[@class='patio-details']
。有人能帮我吗?因为我想返回每个URL的名称
以及一行中的所有相应详细信息?谢谢
标题
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from PatioDetail.items import PatioItem
class MySpider(BaseSpider):
name = "PDSHeader"
allowed_domains = ["http://patios.blogto.com/"]
start_urls = ["http://patios.blogto.com/patio/25-liberty-toronto/", "http://patios.blogto.com/patio/3030-dundas-west-toronto/",
"http://patios.blogto.com/patio/3-speed/", "http://patios.blogto.com//patio/7numbers/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select("//div[@class='patio-head']")
items = []
for listings in listings:
item = PatioItem()
item ["Name"] = listings.select("div[@class='patio-head-details']/div[@class='patio-name']/h2[@class='name']/text()").extract()
items.append(item)
return items
细节
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from PatioDetail.items import PatioItem
class MySpider(BaseSpider):
name = "PDSDetails"
allowed_domains = ["http://patios.blogto.com/"]
start_urls = ["http://patios.blogto.com/patio/25-liberty-toronto/", "http://patios.blogto.com/patio/3030-dundas-west-toronto/",
"http://patios.blogto.com/patio/3-speed/", "http://patios.blogto.com//patio/7numbers/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
listings = hxs.select("//div[@class='patio-details']")
items = []
for listings in listings:
item = PatioItem()
item ["Type"] = listings.select("ul[@class='detail-lister']/li[@class='type-icon']/div[@class='detail-line']/span[@class='detail-desc']/text()").extract()
item ["Covered"] = listings.select("ul[@class='detail-lister']/li[@class='covered-icon']/div[@class='detail-line']/span[@class='detail-desc']/text()").extract()
item ["Heated"] = listings.select("ul[@class='detail-lister']/li[@class='heated-icon']/div[@class='detail-line']/span[@class='detail-desc']/text()").extract()
item ["Capacity"] = listings.select("ul[@class='detail-lister']/li[@class='capacity-icon last']/div[@class='detail-line']/span[@class='detail-desc']/text()").extract()
items.append(item)
return items
您想要的两个部分在同一页上。您需要做的唯一一件事是获取页面并对其进行解析,以便从两个部分获取数据,而不是获取两次并解析两次。
在编写spider之前,应该花一些时间分析要获取的网页的结构 代码示例如下所示:
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = PatioItem()
item['Name'] = hxs.select("//div[@class='patio-name']/h2/text()").extract()[0]
node_type = hxs.select("//ul[@class='detail-lister']/li[@class='type-icon']")
item['Type'] = node_type.select(".//span[@class='detail-desc']/text()").extract()[0]
node_covered = hxs.select("//ul[@class='detail-lister']/li[@class='covered-icon']")
item['Covered'] = node_covered.select(".//span[@class='detail-desc']/text()").extract()[0]
node_heated = hxs.select("//ul[@class='detail-lister']/li[@class='heated-icon']")
item['Heated'] = node_heated.select(".//span[@class='detail-desc']/text()").extract()[0]
node_capacity = hxs.select("//ul[@class='detail-lister']/li[@class='capacity-icon last']")
item['Capacity'] = node_capacity.select(".//span[@class='detail-desc']/text()").extract()[0]
return [item,]
这里有一个关于的教程。这将帮你一个忙:)我得到了代码的第一部分,但是你可以解释一下在将数据传递给项目之前我是如何使用临时dict来存储数据的?我对Python和Scrapy都是新手,所以我还在学习基础知识。感谢代码示例和Xpath教程的链接。实际上,我以前读过该教程,但我很难从教程中的简单示例过渡到更复杂的实际问题。当我尝试插入代码时,收到以下错误:
文件“PatioDetail\spiders\Details.py”,第17行,在解析项['Type']=node\u Type.xpath(“.//span[@class='detail-desc']/text()”)中。extract()[0]exceptions.AttributeError:'XPathSelectorList'对象没有属性“xpath”
。知道我做错了什么吗?@JillAtkins,很抱歉我在scrapy和lxml中使用了选择器。我已经纠正了我的错误。非常感谢。工作完美。非常感谢。