Python 使用xpath和DOM文档进行刮取

Python 使用xpath和DOM文档进行刮取,python,html,xpath,web-scraping,html-parsing,Python,Html,Xpath,Web Scraping,Html Parsing,我正在学习如何从网站上抓取数据,但我被困在了这个网站上。由于隐私问题,我无法在这里发布链接,但我会尽力解释 酒店等级1: <div class = "right"> <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">

我正在学习如何从网站上抓取数据,但我被困在了这个网站上。由于隐私问题,我无法在这里发布链接,但我会尽力解释

酒店等级1:

<div class = "right">
    <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                           3.5
                 </div>

3.5
酒店等级2:

<div class = "right">
    <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                           3.9
                 </div>

3.9
酒店等级3:

<div class = "right">
    <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                           4.2
                 </div>

4.2
像这样,有100个不同级别的酒店,所以我不能使用xpath,或者我对xpath了解不多

我想把餐馆的所有评级,即“3.5”、“3.9”、“4.2”都删掉,但问题是每个评级都有不同的等级和不同的id

请告诉我,我只是一个初学者,我想学习一些东西,那么有人能告诉我如何才能获得酒店的评级吗?? 如果你能给我举个例子,那就太好了。。
`

使用
lxml

这将返回包含评级的所有
div
的列表

import urllib2
from lxml import etree

html = urllib2.urlopen(url)
html_text = etree.HTML(html.read())
rating_list = html_text.xpath('//*[@class="right"]/div') 
#rating_lst = html_text.xpath('//*[@class="right"]')  # choose accordingly, I dont have full source-code so commented out

for rate in rating_list:
     print rate.xpath('text()')
给定样本数据的代码
导入urllib2
从lxml导入etree
data=”“”
3.5
3.9
4.2
"""
#html=urllib2.urlopen(url)#如果从url获取源代码,请使用这两行
#html\u text=etree.html(html.read())
html_text=etree.html(数据)
rating_list=html_text.xpath('/*[@class=“right”]/div')
对于评级列表中的费率:
打印速率.xpath('text()')[0]。带('\n\t')

您应该使用HTML解析器,有多种选择,但这是最容易使用和理解的选择之一。下面是一个示例,它获取具有
rating div
类的
div
元素的文本:

from bs4 import BeautifulSoup

data = """
<div>
    <div class = "right">
        <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                               3.5
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                               3.9
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                               4.2
                     </div>
    </div>
</div>
"""

soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]

如何将这些评级数组存储在变量中。就像在xpath中一样,我喜欢这样:$addressQuery=$xpath->query(//span[@class='search-result-address']/text()”;你能帮我弄这个高拉夫吗?
from bs4 import BeautifulSoup

data = """
<div>
    <div class = "right">
        <div data-res-id = "305281" class = "tooltip rating-for-305281 rating-div left res-snippet-small-rating level-6">
                                               3.5
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "8913" class = "tooltip rating-for-8913 rating-div left res-snippet-small-rating level-7">
                                               3.9
                     </div>
    </div>
    <div class = "right">
        <div data-res-id = "4959" class = "tooltip rating-for-4959 rating-div left res-snippet-small-rating level-8">
                                               4.2
                     </div>
    </div>
</div>
"""

soup = BeautifulSoup(data)
print [r.get_text(strip=True) for r in soup.find_all('div', attrs={'class': 'rating-div'})]
[u'3.5', u'3.9', u'4.2']