如何比较XML标记值并在它们相同时合并它们？（Python）_Python_Xml_Tags_Elementtree_Pdfminer

如何比较XML标记值并在它们相同时合并它们？（Python）

python xml tags

如何比较XML标记值并在它们相同时合并它们？（Python）,python,xml,tags,elementtree,pdfminer,Python,Xml,Tags,Elementtree,Pdfminer,我有一个如下结构的XML文件： C A. P 我 T O L O 我我我实际文件要长得多。我想比较单词的大小，并将相同大小的连续单词合并在一起，保留标记，如下所示： C 美国石油学会吐露港三期到目前为止，可以比较属性，但我不知道如何保留标记。这是迄今为止执行此操作的代码： words=[] root=ET.fromstring（xml） pages=root.findall（'.//页'）对于页面中的页面：上一个_键=无当前_键=无 text=page.findall（'

我有一个如下结构的XML文件：


C
A.
P
我
T
O
L
O
我
我
我

实际文件要长得多。我想比较单词的大小，并将相同大小的连续单词合并在一起，保留标记，如下所示：


C
美国石油学会
吐露港三期

到目前为止，可以比较属性，但我不知道如何保留标记。这是迄今为止执行此操作的代码：

words=[]
root=ET.fromstring（xml）
pages=root.findall（'.//页'）
对于页面中的页面：
上一个_键=无
当前_键=无
text=page.findall（'.//text'）
对于文本中的txt：
如果上一个_键：
当前_键=（txt.attrib.get（'font'，上一个_键[0]），txt.attrib.get（'size'，上一个_键[1]））
其他：
当前_键=（txt.attrib.get（'font'，'empty'），txt.attrib.get（'size'，'empty'））
如果当前_键！=上一键：
words.append（[]）
单词[-1]。追加（txt.text）
上一个\u键=当前\u键
对于文字组：
如果组：
打印（“”.加入（组））

我遗漏了什么？

这应该有用（这不是我写过的最干净的东西，但它确实有用）：

xml=''
C
A.
P
我
T
O
L
O
我
我
我
'''
将xml.etree.ElementTree作为ET导入
new_txt=“”
root=ET.fromstring（xml）
def doit（标记，属性列表，text=”“，last_size=0）：
如果attrib_list.keys（）中的“size”：
如果属性列表['size']！=最后尺寸：
如果最后一个_尺寸！=0 :
s=f“\n好吧，我花了一段时间才弄明白……我甚至不确定它是否仍然相关，但由于解决它已成为我的一个原则问题，我将在这里发布它以供将来参考
该问题的基本要求是，如果元素满足某些要求且元素是连续的，则对元素进行分组。概念是选择满足条件的第一个元素及其所有后续同级元素。然后选择满足条件的最后一个元素及其所有先前同级元素。元素之间的SE2是您的目标。为此，至少在xpath中，您需要使用intersect（）
函数
问题是，intersect（）
是一个xpath 2.0函数，而xpath的主python库（lxml）只支持xpath 1.0。它是，但它太复杂了，会让人头晕目眩，尤其是在这种情况下
因此，我们需要使用一个支持xpath 2.0的python库。有一个-elementpath。我尝试使用elementpath进行此操作，但遇到了另一个问题。在这种情况下，正确应用intersect（）
需要使用xpathcount（）
函数。然而，事实证明，elementpath中的count（）
的实现中存在错误。请解决此问题，并且已在最新版本的elementpath中修复
综上所述，我们现在可以尝试解决实际问题：
#I used a simplified version of the xml to streamline things
sizes = """<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <box>
            <line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
            </line>
        </box>
    </page>
</pages>
"""

import elementpath
from lxml import etree
content = sizes.encode('utf-8')
root = etree.XML(content)

godels = elementpath.select(root, '//text[not(./@size = preceding::text/@size)]/@size') #find out how many different  'size' attribute values there are

for godel in godels:
    gc_expres = f'count(//text[@size="{godel}"][not(preceding-sibling::text[1][@size="{godel}"])])' # for each size, create an expression to determine the number of starting positions
    g_cnt = elementpath.select(root,gc_expres) #activate the function
    for i in range(g_cnt):
        loc = i+1 # the range() method, being pythonian, counts from 0; xpath counts from 1
        top = f'//text[@size="{godel}"][(not(preceding-sibling::text[1][@size="{godel}"]) or count(preceding-sibling::text)=0)][{loc}]/(., following-sibling::text[@size="{godel}"])' #the expression for starting at the top and going down
        bot = f'(//text[@size="{godel}"][following-sibling::text[1][not(@size="{godel}")]])[{loc}]/(.,preceding-sibling::text[@size="{godel}"])' #the expression for starting at the bottom and going up
        int_expr = f'{top} intersect {bot}' #the intersect expression
        combo = elementpath.select(root, int_expr) #the intersect function in action!
        newt = ''.join([str((i.text)) for i in combo]) #now that we have the group, create a string of their combined text values
        combo[0].text=newt #replace the text of the first group member with new combined string
        for i in range(1,len(combo)): #the range skips over this first group member
            combo[0].getparent().remove(combo[i]) #remove all other members of the gorup
print(etree.tostring(root).decode())

#我使用了简化版的xml来简化事情
尺寸=”“
C
A.
P
我
T
O
L
O
我
我
我
"""
导入元素路径
从lxml导入etree
内容=大小。编码（'utf-8'）
root=etree.XML（内容）
godels=elementpath.select（root，//text[非（./@size=previous:：text/@size）]/@size'）#找出有多少不同的“size”属性值
对于godel in godels：
gc_expres=f'count（//text[@size=“{godel}]”][not（前面的兄弟姐妹：：text[1][@size=“{godel}]））”）。#对于每个大小，创建一个表达式以确定起始位置的数量
g_cnt=elementpath。选择（root，gc_expres）#激活该函数
对于范围内的i（g_cnt）：
loc=i+1#range（）方法是pythonian，从0开始计数；xpath从1开始计数
top=f'//text[@size=“{godel}”][（不是（前面的兄弟姐妹：：text[1][@size=“{godel}]”）或count（前面的兄弟姐妹：：text）=0）][{loc}]/（，后面的兄弟姐妹：：text[@size=“{godel}]）。#从顶部开始向下的表达式
bot=f'（//text[@size=“{godel}”][following sibling:：text[1][not（@size=“{godel}”）[{loc}]/（，preference sibling:：text[@size=“{godel}]）是从下开始向上的表达式
int_expr=f'{top}intersect{bot}'#intersect表达式
combo=elementpath。选择（root，int_expr）#运行中的intersect函数！
newt=''.join（[str（（i.text））for i in combo]）#现在我们有了这个组，创建一个由它们的组合文本值组成的字符串
combo[0]。text=newt#将第一个组成员的文本替换为新的组合字符串
对于范围（1，len（combo））中的i:#范围跳过第一个组成员
组合[0].getparent（）.remove（组合[i]）#删除gorup的所有其他成员
打印（etree.tostring（root.decode（））

输出：
<pages>
    <page>
        <box>
            <line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </line>
        </box>
    </page>
</pages>


C
美国石油学会
吐露港
三,
我在y中遗漏了一些东西
<pages>
    <page>
        <box>
            <line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </line>
        </box>
    </page>
</pages>