Python 使用xpath和DOM文档进行刮取
我正在学习如何从网站上抓取数据,但我被困在了这个网站上。由于隐私问题,我无法在这里发布链接,但我会尽力解释 酒店等级1:Python 使用xpath和DOM文档进行刮取,python,html,xpath,web-scraping,html-parsing,Python,Html,Xpath,Web Scraping,Html Parsing,我正在学习如何从网站上抓取数据,但我被困在了这个网站上。由于隐私问题,我无法在这里发布链接,但我会尽力解释 酒店等级1: <div class = "right"> <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
3.5
酒店等级2:
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
3.9
酒店等级3:
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
4.2
像这样,有100个不同级别的酒店,所以我不能使用xpath,或者我对xpath了解不多
我想把餐馆的所有评级,即“3.5”、“3.9”、“4.2”都删掉,但问题是每个评级都有不同的等级和不同的id
请告诉我,我只是一个初学者,我想学习一些东西,那么有人能告诉我如何才能获得酒店的评级吗??
如果你能给我举个例子,那就太好了。。
`使用
lxml
库
这将返回包含评级的所有div
的列表
import urllib2
from lxml import etree
html = urllib2.urlopen(url)
html_text = etree.HTML(html.read())
rating_list = html_text.xpath('//*[@class="right"]/div')
#rating_lst = html_text.xpath('//*[@class="right"]') # choose accordingly, I dont have full source-code so commented out
for rate in rating_list:
print rate.xpath('text()')
给定样本数据的代码
导入urllib2
从lxml导入etree
data=”“”
3.5
3.9
4.2
"""
#html=urllib2.urlopen(url)#如果从url获取源代码,请使用这两行
#html\u text=etree.html(html.read())
html_text=etree.html(数据)
rating_list=html_text.xpath('/*[@class=“right”]/div')
对于评级列表中的费率:
打印速率.xpath('text()')[0]。带('\n\t')
您应该使用HTML解析器,有多种选择,但这是最容易使用和理解的选择之一。下面是一个示例,它获取具有rating div
类的div
元素的文本:
from bs4 import BeautifulSoup
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]
如何将这些评级数组存储在变量中。就像在xpath中一样,我喜欢这样:$addressQuery=$xpath->query(//span[@class='search-result-address']/text()”;你能帮我弄这个高拉夫吗?
from bs4 import BeautifulSoup
data = """
<div>
<div class = "right">
<div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
3.5
</div>
</div>
<div class = "right">
<div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
3.9
</div>
</div>
<div class = "right">
<div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
4.2
</div>
</div>
</div>
"""
soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]
[u'3.5', u'3.9', u'4.2']