Html 网页抓取。。。使用标签获取所有内容，包括其他标签_Html_Web Scraping_Lxml

Html 网页抓取。。。使用标签获取所有内容，包括其他标签

html web-scraping

Html 网页抓取。。。使用标签获取所有内容，包括其他标签,html,web-scraping,lxml,Html,Web Scraping,Lxml,我有下面的标签 <div class="example"> <p> text <a href="#"> link </a> text</p> </div> 这给了我一个段落标记的列表，然后我将它们连接在一起 description = ' '.join('<p>{0}</p>'.format(paragraph) for paragraph in description) descript

我有下面的标签

<div class="example">
    <p> text <a href="#"> link </a> text</p>
</div>

这给了我一个段落标记的列表，然后我将它们连接在一起

description = ' '.join('<p>{0}</p>'.format(paragraph) for paragraph in description)

description=''.join（'{0}'.description中段落的格式（段落））

但是必须有一种方法可以直接在div中获取内容？谢谢

卡尔我找到了一个解决办法。。。不漂亮，但它给了我想要的

dummy = tree.xpath('//div[@class="example"]/div[2]/div/node()')   
description = ''
for paragraph in dummy:
    try:
        description += html.tostring(paragraph)
    except:
        pass

我找到了一个解决办法。。。不漂亮，但它给了我想要的

dummy = tree.xpath('//div[@class="example"]/div[2]/div/node()')   
description = ''
for paragraph in dummy:
    try:
        description += html.tostring(paragraph)
    except:
        pass

您只需获取标记内的所有节点：

h = """<div class="example">
<p> text <a href="#"> link </a> text</p>
<p> othertext <a href="#"> otherlink </a> text</p>
</div>"""

from lxml import html

x = html.fromstring(h)

print("".join(html.tostring(n) for n in x.xpath("//div[@class='example']/*")))

无需进行任何尝试/除此之外。

您只需获取标记中的所有节点：

h = """<div class="example">
<p> text <a href="#"> link </a> text</p>
<p> othertext <a href="#"> otherlink </a> text</p>
</div>"""

from lxml import html

x = html.fromstring(h)

print("".join(html.tostring(n) for n in x.xpath("//div[@class='example']/*")))

除此之外，没有必要做任何尝试

<p> text <a href="#"> link </a> text</p>
<p> othertext <a href="#"> otherlink </a> text</p>

"".join(html.tostring(n) for n in x.xpath("//div[@class='example']")[0].iterchildren())