Python 从元素/节点中提取HTML_Python_Xpath_Scrapy

Python 从元素/节点中提取HTML

python xpath scrapy

Python 从元素/节点中提取HTML,python,xpath,scrapy,Python,Xpath,Scrapy,假设有一个html字符串 <div class="content"> This is some test <b>this is bold </b> this is great list of text. </div> <div class="content"> <ul> <li>Item 1</li> <li>Item 2</li>

假设有一个html字符串

<div class="content">
   This is some test <b>this is bold </b> this is great list of text.
</div>
<div class="content">
   <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
   </ul>
</div>

如何将两个元素/节点的整个嵌套HTML作为变量中的字符串获取？

如果速度不重要，可以使用BeautifulSoup轻松实现

您可以使用

/node（）

——请参见对类似问题的回答

# Returns all child nodes - text as well as elements.
contents = product.select('//div[@class="content"]/node()').extract()

请注意，

extract（）

将返回一个列表，您可以通过通常的方式连接该列表来恢复HTML：

html = "\n".join(contents)

下面是xpath

//div[@class="content"]/text()|//div[@class="content"]/b/text()|//div[@class="content"]/ul/li

给出结果，因为您只需要存储两个元素的数据

contents=product.select('//div[@class="content"]/text()|//div[@class="content"]/b/text()|//div[@class="content"]/ul/li').extract()

现在，内容既有元素的数据，也有元素的数据

希望使用本机支持

//div[@class="content"]/text()|//div[@class="content"]/b/text()|//div[@class="content"]/ul/li

contents=product.select('//div[@class="content"]/text()|//div[@class="content"]/b/text()|//div[@class="content"]/ul/li').extract()