使用Python Lxml解析静态html文件中的隐藏元素

使用Python Lxml解析静态html文件中的隐藏元素,python,html,xpath,lxml,Python,Html,Xpath,Lxml,我有一组静态Html文件,需要从中解析和获取一些详细信息。我正在使用Python-lxml模块获取所需的详细信息。静态Html文件的示例如下所示: <div class="top"> <a data-bind="text">abc</a> <span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.texto

我有一组静态Html文件,需要从中解析和获取一些详细信息。我正在使用Python-lxml模块获取所需的详细信息。静态Html文件的示例如下所示:

<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>

<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>
但是这个代码不能帮助我得到我想要的结果,我犯了什么错误?

预期产出: abc 4
dfg 0有几种方法可以解决这个问题,这里有一种方法:获取“星”评级元素,如果没有找到,则返回第一个“可见”元素的索引降到0。我们可以利用并实现以下目标:

def is_visible(element):
    """Naive implementation of the element visibility check."""
    return 'display: none;' not in element.attrib.get("style", "")


def get_rating(entry):
    rating_elements = entry.xpath(".//span[contains(@class, 'star')]")
    visibile_rating = (index 
                       for index, element in enumerate(rating_elements, start=1)
                       if is_visible(element))
    return next(visibile_rating, 0)


root = fromstring(html)
for sali in root.xpath('//div[@class="top"]'):
    for x in sali.xpath('a'):
        print(x.text, get_rating(sali))
印刷品:

('abc', 4)
('dfg', 0)

请注意,
class
属性是一个多值属性,严格来说,
contains()
不是作业通过类值查找元素的最佳工具:


    • 您可以通过BeautifulSoup使用lxml。更熟悉Python的人可能会整理一下

      from bs4 import BeautifulSoup
      
      html = '''
      <div class="top">
      <a data-bind="text">abc</a>
      <span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
          </span>
      <span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
          </span>
      <span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
      <span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
      <div class="adr">
          <span></span>
          <span class="locality" data-bind="text: hotel.pob"></span>
      </div>
      </div>
      
      <div class="top">
      <a data-bind="text">dfg</a>
      <span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
          </span>
      <span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
          </span>
      <span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
      <div class="adr">
          <span></span>
          <span class="locality" data-bind="text: hotel.pob"></span>
      </div>
      '''
      
      soup = BeautifulSoup(html, 'lxml')
      ratings = []
      for item in soup.select("div.top"):
          hotel = item.select_one('a').text
          found = False
          for item2 in item.select("[data-bind*='visible:hotel.cat']"):
              try:
                  style = item2['style']
              except KeyError as e:
                  rating = item2['data-bind'].strip("visible:hotel.cat === ").strip("'")
                  found = True
                  break
          ratings.append([hotel + ' ' + rating if found else hotel + ' 0'])
      print(ratings)
      
      从bs4导入美化组
      html=“”
      

      谢谢,如果我使用tree.cssselect(star.sprite.disponibilidad)而不是“contains”,这是一种更好的方法吗?@justjoy是的,肯定是一个更好的选择,试试吧。另外,创建两个函数会影响脚本的整体运行时间吗?@justjoy一般来说,函数调用是有代价的,但是,除非您有非常多的调用,并且已经克服了所有其他瓶颈(例如,在本例中,XPath中的HTML解析或树遍历),并且执行时间非常重要,否则您可能需要担心额外的函数调用:)谢谢,我将尝试解决方案并返回这里。
      from bs4 import BeautifulSoup
      
      html = '''
      <div class="top">
      <a data-bind="text">abc</a>
      <span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
          </span>
      <span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
          </span>
      <span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
      <span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
      <div class="adr">
          <span></span>
          <span class="locality" data-bind="text: hotel.pob"></span>
      </div>
      </div>
      
      <div class="top">
      <a data-bind="text">dfg</a>
      <span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
          </span>
      <span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
          </span>
      <span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
      <span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
      <div class="adr">
          <span></span>
          <span class="locality" data-bind="text: hotel.pob"></span>
      </div>
      '''
      
      soup = BeautifulSoup(html, 'lxml')
      ratings = []
      for item in soup.select("div.top"):
          hotel = item.select_one('a').text
          found = False
          for item2 in item.select("[data-bind*='visible:hotel.cat']"):
              try:
                  style = item2['style']
              except KeyError as e:
                  rating = item2['data-bind'].strip("visible:hotel.cat === ").strip("'")
                  found = True
                  break
          ratings.append([hotel + ' ' + rating if found else hotel + ' 0'])
      print(ratings)