Python Xml中的单词合并

Python Xml中的单词合并,python,xml,python-2.7,lxml,findall,Python,Xml,Python 2.7,Lxml,Findall,在以下xml中: <w:body> <w:p w:rsidR="00912B30" w:rsidRPr="00912B30" w:rsidRDefault="00912B30" w:rsidP="00912B30"> <w:pPr> <w:autoSpaceDE w:val="0"/> <w:autoSpaceDN w:val="0"/> &


    <w:p w:rsidR="00912B30" w:rsidRPr="00912B30" w:rsidRDefault="00912B30" w:rsidP="00912B30">
            <w:autoSpaceDE w:val="0"/>
            <w:autoSpaceDN w:val="0"/>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
        <w:r w:rsidRPr="00912B30">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t xml:space="preserve">Considering those situations, after 1970 The </w:t>
        <w:r w:rsidRPr="00E155EC">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t>Agricultural Land Law</w:t>
        <w:r w:rsidRPr="00912B30">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t xml:space="preserve"> of 1952 was modified and changed the principle to permit renting and lending agricultural land. The way of thinking was as follows. If it was difficult to widen farmers’ size by buying agricultural land, expanding the size by renting would be possible. After that some positive framework to promote renting and lending agricultural land. For example, The </w:t>
        <w:r w:rsidRPr="00E155EC">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t>Agricultural Land Use Promotion Project</w:t>
        <w:r w:rsidRPr="00912B30">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t xml:space="preserve"> had started in 1975 and The </w:t>
        <w:r w:rsidRPr="00E155EC">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t>Agricultural Land Use Promotion Law</w:t>
        <w:r w:rsidRPr="00912B30">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>
            <w:t xml:space="preserve"> was established in 1980. Actually after that, area of agricultural land by transfer of ownership of owned agricultural land with compensation had been more than the area by transfer of rights for </w:t>
        <w:r w:rsidRPr="00912B30">
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/>
                <w:snapToGrid w:val="0"/>
                <w:kern w:val="0"/>
                <w:szCs w:val="21"/>

 text = ""  #initialize empty string where all words will be stored
    source = etree.parse(doc_xml)
    for p in source.findall('.//'+w1+'p'): #iterate over every p tag
        text+= " "      # add a space to separate words in successive paragraphs
        for b in p.findall('.//{%(ns)s}strike/../..//{%(ns)s}t' %{'ns':w}):
            text+=''.join(b.text) #joins all strike text and appends to empty string
text =" Agricultural Land LawAgricultural Land Use Promotion ProjectAgricultural Land Use Promotion Law"

text = " Agricultural Land Law Agricultural Land Use Promotion Project Agricultural Land Use Promotion Law" 
原始修复: 将最后一行代码替换为:

text+=" " +''.join(b.text)
“he lp”


<w:r w:rsidRPr:00C42D65>
<w:r w:rsidRPr:00C42D65>




text = ''
for t in source.xpath('.//w:p//w:r//w:t',namespaces={'w': w}):
    if t.xpath('..//w:strike',namespaces={'w': w}):
        text += t.text
        if text:  # To prevent space before the first text.
            text += ' '

内部循环。(除第一个元素外)是否有一个“not equal”函数来匹配某个未出现的属性??就像这里的一个:@Swordy,是的,有:
//element[not(@attribute you not want)]

source = etree.parse(doc_xml)
text = ' '.join(
    source.xpath('.//w:p//w:strike/../..//w:t/text()', namespaces={'w': w})
text = ''
for t in source.xpath('.//w:p//w:r//w:t',namespaces={'w': w}):
    if t.xpath('..//w:strike',namespaces={'w': w}):
        text += t.text
        if text:  # To prevent space before the first text.
            text += ' '