Python 此特定xml的xml解析_Python_Xml

Python 此特定xml的xml解析

python xml

Python 此特定xml的xml解析,python,xml,Python,Xml,如果使用BeautifulSoup是一个选项，那么它将非常简单： import xml.etree.ElementTree as et tree = et.parse(os.getcwd()+"/../data/train.xml") instance = tree.getroot() for stuff in instance: if(stuff.tag == "answer"): print "the correct answer is %s

如果使用BeautifulSoup是一个选项，那么它将非常简单：

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print stuff.text

如果您更喜欢使用ElementTree，则应使用

itertext

处理所有文本：

import bs4
xtxt = '''        <instance id="activate.v.bnc.00024693" docsrc="BNC">
    <answer instance="activate.v.bnc.00024693" senseid="38201"/>
    <context>
    Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
    </context>
    </instance>'''
soup = bs4.BeautifulSoup(xtxt)
print soup.find('context').text

如果您确信您的xml文件是正确的，ElementTree就足够了，因为它是标准Python库的一部分，您将没有外部依赖性。但是，如果XML可能格式不正确，那么BeautifulSoup在修复小错误方面非常出色。

如果使用BeautifulSoup是一个选项，那么它将非常简单：

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print stuff.text

如果您更喜欢使用ElementTree，则应使用

itertext

处理所有文本：

import bs4
xtxt = '''        <instance id="activate.v.bnc.00024693" docsrc="BNC">
    <answer instance="activate.v.bnc.00024693" senseid="38201"/>
    <context>
    Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
    </context>
    </instance>'''
soup = bs4.BeautifulSoup(xtxt)
print soup.find('context').text

如果您确信您的xml文件是正确的，ElementTree就足够了，因为它是标准Python库的一部分，您将没有外部依赖性。但是，如果XML可能格式不正确，BeautifulSoup很擅长修复小错误。

可以使用元素序列化。有两种选择：

保持内部
只返回没有任何标记的文本

如果使用标签进行序列化，可手动移除外部

标签：

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print ''.join(stuff.itertext())

#将元素转换为字符串并删除标记
打印（et.tostring（stuff）.strip（）.lstrip（“”）.rstrip（“”）））
#不带任何标记的只读文本
打印（et.tostring（stuff，method='text'））

可以使用元素序列化。有两种选择：

保持内部
只返回没有任何标记的文本

如果使用标签进行序列化，可手动移除外部

标签：

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print ''.join(stuff.itertext())

#将元素转换为字符串并删除标记
打印（et.tostring（stuff）.strip（）.lstrip（“”）.rstrip（“”）））
#不带任何标记的只读文本
打印（et.tostring（stuff，method='text'））