Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用粗体标记从HTML中提取文本,保留顺序_Python_Xpath - Fatal编程技术网

Python 使用粗体标记从HTML中提取文本,保留顺序

Python 使用粗体标记从HTML中提取文本,保留顺序,python,xpath,Python,Xpath,我正在尝试从以下结构的html文件中提取文本: <td class='srctext> <pre> <b> Heading 1 </b> text more text <b> Heading 2 </b> even more text, <b> also some bold text </b> and the last text </pr

我正在尝试从以下结构的html文件中提取文本:

<td class='srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
通过使用string()包装器,我可以得到所有粗体标记的内部文本,也可以得到pre的整个内部文本

然而,我正在努力做的是得到如下结果:

from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
</body>
</html>'''

htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)

如果有什么不清楚的地方,请毫不犹豫地询问。

尝试使用
//td[@class='srctext']/pre//text()[normalize-space()]
作为XPath(假设您拥有完整的XPath 1.0支持,例如lxml,而不是受限的ElementTree XPath支持)

完整的例子是

[' Heading 1 ', '\n    text\n    more text\n    ', ' Heading 2 ', '\n    even more text, \n    ', ' also some bold text ', '\n    and the last text\n']

如果我正确理解了您的问题,那么您希望忽略html结构并提取列表中的文本片段,每个列表元素都是一个不包含任何标记的字符串

通常,使用正则表达式解析XML或HTML是一个糟糕的想法,但这个问题是它的罕见使用案例之一。假设您已在单个字符串中读取所有文件:

from lxml import etree as ET
html = '''<html><body><table><tr><td class=srctext>
<pre>
    <b> Heading 1 </b>
    text
    more text
    <b> Heading 2 </b>
    even more text, 
    <b> also some bold text </b>
    and the last text
</pre>
</body>
</html>'''

htmlEl = ET.HTML(html)
textValues = htmlEl.xpath("//td[@class='srctext']/pre//text()[normalize-space()]")
print(textValues)
[' Heading 1 ', '\n    text\n    more text\n    ', ' Heading 2 ', '\n    even more text, \n    ', ' also some bold text ', '\n    and the last text\n']
[ i.strip() for i in re.findall(r'(.*?)<.*?>', t, re.DOTALL) if len(i.strip()) > 0]
['Heading 1', 'text\n    more text', 'Heading 2', 'even more text,', 'also some bold text', 'and the last text']