Python从精确位置获取HTML元素/节点/标记_Python_Html_Python 3.x

Python从精确位置获取HTML元素/节点/标记

python html python-3.x

Python从精确位置获取HTML元素/节点/标记,python,html,python-3.x,Python,Html,Python 3.x,我有一个很长的html文档，我知道其中某些文本的确切位置。例如： <html> <body> <div> <a> <b> I know the exact position of this text </b> <i> Another text </i> </

我有一个很长的html文档，我知道其中某些文本的确切位置。例如：

<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>


我知道这篇文章的确切位置
另一个文本

我知道“我知道这篇文章的确切位置”这句话从字符号“x”开始，到字符号“y”结束。但是我必须得到整个标记/节点/元素，它保存这个值。可能有几个是它的祖先

我怎样才能轻松处理它

//编辑

说清楚一点，我唯一得到的是一个整数值，它描述了句子的开头

例如-2048年

我不能对文档的结构做任何假设。从某个点开始，我必须逐个祖先地遍历节点

即使是位置（2048）所指的句子也不一定是唯一的。

假设

在本例中是唯一的，您可以将

XPath

与

xml.etree.elementtree

一起使用

import xml.etree.elementtree as ET
tree = ET.parse('xmlfile')
root = tree.get(root)
myEle = root.findall(".//*[b]")

myEle

现在将保留对“b”的父级的引用，在本例中，它是“a”

如果只需要

元素，则可以执行以下操作：

myEle = root.findall(".//b")

如果你想要

的孩子，你可以做一些不同的事情：

myEle = root.findall(".//a//")
myEle = root.findall('.//*[a]//*')[1:]

有关XPath的更多信息，请参见此处：

您可以将整个HTML文档的内容作为字符串读取。然后，您可以使用标记（具有唯一id的HTML锚元素）获取修改后的字符串，并使用

xml.etree.ElementTree

解析该字符串，就像该标记位于原始文档中一样。然后可以使用XPath找到标记的父元素，并删除辅助标记。结果包含的结构就像原始文档被解析一样。但是现在你知道了文本中的元素

警告：您必须知道该位置是字节位置还是抽象字符位置。（考虑多字节编码或编码某些字符的非固定长度序列。也考虑行尾——一个或两个字节。）

请尝试使用Windows行结尾将问题中的示例存储在

data.html

中的示例：

#!python3

import xml.etree.ElementTree as ET

fname = 'doc.html'
pos = 64

with open(fname, encoding='utf-8') as f:
    content = f.read()

# The position_id will be used in XPath, the position_anchor
# uses the variable only for readability. The position anchor
# has the form of an HTML element to be found easily using 
# the XPath expression.
position_id = 'my_unique_position_{}'.format(pos)
position_anchor = '<a id="{}" />'.format(position_id)

# The modified content has one extra anchor as the position marker.
modified_content = content[:pos] + position_anchor + content[pos:]

root = ET.fromstring(modified_content)
ET.dump(root)
print('----------------')

# Now some examples for getting the info around the point.
# '.' = from here; '//' = wherever; 'a[@id=...]' = anchor (a) element
# with the attribute id with the value. 
# We will not use it later -- only for demonstration.
anchor_element = root.find('.//a[@id="{}"]'.format(position_id))
ET.dump(anchor_element)
print('----------------')

# The text at the original position -- the text became the tail 
# of the element.
print(repr(anchor_element.tail))
print('================')

# Now, from scratch, get the nearest parent from the position.
parent = root.find('.//a[@id="{}"]/..'.format(position_id))
ET.dump(parent)
print('----------------')

# ... and the anchor element (again) as the nearest child
# with the attributes.
anchor = parent.find('./a[@id="{}"]'.format(position_id))
ET.dump(anchor)
print('----------------')

# If the marker split the text, part of the text belongs to 
# the parent, part is the tail of the anchor marker.
print(repr(parent.text))
print(repr(anchor.tail))
print('----------------')

# Modify the parent to remove the anchor element (to get
# the original structure without the marker. Do not forget
# that the text became the part of the marker element as the tail.
parent.remove(anchor)
parent.text += anchor.tail
ET.dump(parent)
print('----------------')

# The structure of the whole document now does not contain 
# the added anchor marker, and you get the reference
# to the nearest parent.
ET.dump(root)
print('----------------')

#！蟒蛇3
将xml.etree.ElementTree作为ET导入
fname='doc.html'
位置=64
将open（fname，encoding='utf-8'）作为f：
content=f.read（）
#position\u id将在XPath中使用，即position\u锚点
#仅为可读性使用变量。定位锚
#具有HTML元素的形式，可以使用
#XPath表达式。
position\u id='my\u unique\u position{}'。格式（pos）
位置锚=''。格式（位置id）
#修改后的内容有一个额外的锚点作为位置标记。
修改内容=内容[：位置]+位置锚定+内容[位置]
root=ET.fromstring（修改的内容）
ET.dump（根目录）
打印（'-------------'）
#现在，我们来举几个例子来了解这一点。
#'.'=从这里开始；'/'=无论在哪里；'[@id=…]'=锚点（a）元素
#属性id的值。
#稍后我们将不使用它——仅用于演示。
anchor_element=root.find（'.//a[@id=“{}”]'.format（position_id））
ET.dump（锚定元素）
打印（'-------------'）
#原始位置的文本--文本成为尾部
#元素的属性。
打印（报告（锚定元素尾部））
打印（'======================'）
#现在，从头开始，从该位置获取最近的父对象。
parent=root.find（'.//a[@id=“{}”]/...format（position_id））
ET.dump（父级）
打印（'-------------'）
# ... 和锚元素（再次）作为最近的子元素
#使用属性。
anchor=parent.find（'./a[@id=“{}”]'.format（position_id））
ET.dump（锚定）
打印（'-------------'）
#如果标记拆分文本，则部分文本属于
#父零件是定位标记的尾部。
打印（repr（parent.text））
打印（报告（锚定尾部））
打印（'-------------'）
#修改父元素以删除锚元素（以获取
#没有标记的原始结构。不要忘记
#文本作为尾部成为标记元素的一部分。
父对象。移除（锚定）
parent.text+=anchor.tail
ET.dump（父级）
打印（'-------------'）
#整个文档的结构现在不包含
#添加了锚标记，您将获得引用
#给最近的父母。
ET.dump（根目录）
打印（'-------------'）

它打印以下内容：

c:\_Python\Dejwi\so25370255>a.py
<html>
  <body>
    <div>
      <a>
        <b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
' the exact position of this text\n        '
================
<b>
          I know<a id="my_unique_position_64" /> the exact position of this text

        </b>

----------------
<a id="my_unique_position_64" /> the exact position of this text

----------------
'\n          I know'
' the exact position of this text\n        '
----------------
<b>
          I know the exact position of this text
        </b>

----------------
<html>
  <body>
    <div>
      <a>
        <b>
          I know the exact position of this text
        </b>
        <i>
          Another text
        </i>
      </a>
    </div>
  </body>
</html>
----------------

c:\\u Python\devwi\so25370255>a.py
我知道这篇文章的确切位置
另一个文本
----------------
这篇文章的确切位置
----------------
'此文本的确切位置\n'
================
我知道这篇文章的确切位置
----------------
这篇文章的确切位置
----------------
“\n我知道”
'此文本的确切位置\n'
----------------
我知道这篇文章的确切位置
----------------
我知道这篇文章的确切位置
另一个文本
----------------

您是如何获得文本的位置的？在搜索文本时找到了吗？它是输入数据的一部分。好的。文档有多长？是千字节还是兆字节？将其作为字符串保存在内存中是否合理？它仍然合理。您可以根据自己的意愿解析它-只要您记住它是HTML而不是XML（不幸的是）。不幸的是，我不能对文档的结构做任何假设。@Dejwi您仍然可以使用

findall（“../b”）

来获取

元素。

myEles=root.findall（“../b”）

，然后使用

来获取myEles中的ele:if ele.text.strip（）！=''：#做点什么

这个句子也可以包含在任何其他标记中。你只是想在root.iter（）中找到所有包含文本的元素吗？

：if item.text！=''