Python 如何从MMAX2注释的XML语料库中访问数据_Python_Xml_Python 3.x_Pandas_Nlp

Python 如何从MMAX2注释的XML语料库中访问数据

python xml python-3.x pandas nlp

Python 如何从MMAX2注释的XML语料库中访问数据,python,xml,python-3.x,pandas,nlp,Python,Xml,Python 3.x,Pandas,Nlp,我有一个带注释的语料库来完成共指消解的任务。你能告诉我如何从xml文件中提取数据吗。我做了以下工作，但没有工作 from lxml import objectify import pandas as pd xml = objectify.parse(open('Dari_Coref_2_coref_level.xml')) root = xml.getroot() df = pd.DataFrame(columns='markable') for i in

我有一个带注释的语料库来完成共指消解的任务。你能告诉我如何从xml文件中提取数据吗。我做了以下工作，但没有工作

from lxml import objectify
import pandas as pd

    xml = objectify.parse(open('Dari_Coref_2_coref_level.xml'))
    root = xml.getroot()

    df = pd.DataFrame(columns='markable')

    for i in range(0, 2):
        obj = root.getchildren()[i].getchildren()
        row = dict(zip(['markable'], [obj[0].text]))
        row_s = pd.Series(row)
        row_s.name = i
        df = df.append(row_s)

 print(df)

我的xml文件的结构如下所示：

 <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/coref">
<markable id="markable_1" span="word_1..word_4" mentiontype="ne"  
coref_class="set_1"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_3" span="word_33..word_34" mentiontype="ne"  
coref_class="set_2"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_2" span="word_5..word_9" mentiontype="np"  
coref_class="set_1"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_5" span="word_89..word_90" mentiontype="np"  
coref_class="set_3"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_4" span="word_35..word_44" mentiontype="np"  
coref_class="set_2"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_7" span="word_124..word_126" mentiontype="ne"  
coref_class="set_4"  mmax_level="coref"  coreftype="ident" />
<markable id="markable_6" span="word_91..word_95" mentiontype="np"  
coref_class="set_3"  mmax_level="coref"  coreftype="ident" />
</markables>

试试这个

import lxml.html

with open('Dari_Coref_2_coref_level.xml', 'rb') as file:
    xml = file.read()

tree = lxml.html.fromstring(xml)

#Use Xpath to extract the data you want.
# For example to extract ids of the tag markable, you can do
ids = tree.xpath("//markable/@id")
print(ids) # ['markable_1', 'markable_3', 'markable_2', ...]

Xpath语法教程：

欢迎使用StackOverflow。如中所述，此站点是有用问题及其答案的存储库，而不是帮助论坛。至少，您需要详细解释您所指的但不起作用的内容，显示您希望提取的内容，以及带注释的XML语料库的含义。请参加，访问，特别是阅读和学习如何有效地使用本网站。