Python 如何在选择器中选择子元素_Python_Beautifulsoup_Lxml_Scrapy

Python 如何在选择器中选择子元素

python scrapy

Python 如何在选择器中选择子元素,python,beautifulsoup,lxml,scrapy,Python,Beautifulsoup,Lxml,Scrapy,我正在使用HTMLXPathSelector来解析HTML内容。目标网站有一个随机的HTML标签。例如：其格式可能是： <div class="doctor_ans"> <h3>Title</h3> <p style="text-align: justify;"> <span style="font-size: 12px;"> <span style="font-family: arial,helvet

我正在使用HTMLXPathSelector来解析HTML内容。目标网站有一个随机的HTML标签。例如：其格式可能是：

<div class="doctor_ans">
  <h3>Title</h3>
  <p style="text-align: justify;">
    <span style="font-size: 12px;">
      <span style="font-family: arial,helvetica,sans-serif;">
        <font color="#000000">I would like to get contain here.</font>
      </span>
    </span>
  </p>    
</div>


标题

我想在这里得到控制。

或


标题

我想在这里得到控制。>

或


标题

我想在这里得到控制。

或


标题

我想在这里得到控制。

等等。

请给我你的建议如何解析这个内容。HTML标记随机出现。因此，我需要一种方法来获取子元素以找到最终元素。

我有更多使用Selenium的经验，但xpath部分应该是相同的。使用xpath='.//span'选择子元素，然后获取该元素的.text。如果子元素为空，则放弃，然后移动到下一个元素

hxs = HtmlXPathSelector(response)
hxs.select('div[@class="doctor_ans"]/p[1]//text()').extract()

将在

doctor\ans

div.

的第一段中为您提供一份单独的文本列表，问题是？你试过什么？

<div class="doctor_ans">
  <h3>Title</h3>
  <p>
    <span style="font-size: 12px;">
      <span style="font-family: arial,helvetica,sans-serif;">
        <font color="#000000">I would like to get contain here.</font>
      </span>
    </span>
  </p>    
</div>

<div class="doctor_ans">
  <h3>Title</h3>
  <p>
    <span style="font-size: 12px;">
        I would like to get contain here.
    </span>
  </p>    
</div>

hxs = HtmlXPathSelector(response)
hxs.select('div[@class="doctor_ans"]/p[1]//text()').extract()