Python 使用粗体标记从HTML中提取文本，保留顺序_Python_Xpath

Python 使用粗体标记从HTML中提取文本，保留顺序

python xpath

Python 使用粗体标记从HTML中提取文本，保留顺序,python,xpath,Python,Xpath,我正在尝试从以下结构的html文件中提取文本： <td class='srctext> <pre> Heading 1 text more text Heading 2 even more text, also some bold text and the last text </pr

我正在尝试从以下结构的html文件中提取文本：

<td class='srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>

通过使用string（）包装器，我可以得到所有粗体标记的内部文本，也可以得到pre的整个内部文本

然而，我正在努力做的是得到如下结果：

from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
</body>
</html>'''

htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)

如果有什么不清楚的地方，请毫不犹豫地询问。

尝试使用

//td[@class='srctext']/pre//text（）[normalize-space（）]

作为XPath（假设您拥有完整的XPath 1.0支持，例如lxml，而不是受限的ElementTree XPath支持）

完整的例子是

[' Heading 1 ', '\n    text\n    more text\n    ', ' Heading 2 ', '\n    even more text, \n    ', ' also some bold text ', '\n    and the last text\n']

如果我正确理解了您的问题，那么您希望忽略html结构并提取列表中的文本片段，每个列表元素都是一个不包含任何标记的字符串

通常，使用正则表达式解析XML或HTML是一个糟糕的想法，但这个问题是它的罕见使用案例之一。假设您已在单个字符串中读取所有文件：

from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
</body>
</html>'''

htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)

[' Heading 1 ', '\n    text\n    more text\n    ', ' Heading 2 ', '\n    even more text, \n    ', ' also some bold text ', '\n    and the last text\n']

[ i.strip() for i in re.findall(r'(.*?)<.*?>', t, re.DOTALL) if len(i.strip()) > 0]

['Heading 1', 'text\n    more text', 'Heading 2', 'even more text,', 'also some bold text', 'and the last text']