Java 从XML文本中提取数据

Java 从XML文本中提取数据,java,xml,xpath,sax,Java,Xml,Xpath,Sax,我有很多XML数据,如下所示: <contextfile concordance=brown> <context filename=br-a02 paras=yes> <p pnum=1> <s snum=1> <wf cmd=done pos=NN lemma=committee wnsn=1 lexsn=1:14:00::>Committee</wf> <wf cmd=done pos=NN lemma=appro

我有很多XML数据,如下所示:

<contextfile concordance=brown>
<context filename=br-a02 paras=yes>
<p pnum=1>
<s snum=1>
<wf cmd=done pos=NN lemma=committee wnsn=1 lexsn=1:14:00::>Committee</wf>
<wf cmd=done pos=NN lemma=approval wnsn=1 lexsn=1:04:02::>approval</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Gov._Price_Daniel</wf>
<wf cmd=ignore pos=POS>'s</wf>
<punc>``</punc>
<wf cmd=done pos=JJ lemma=abandoned wnsn=1 lexsn=5:00:00:uninhabited:00>abandoned</wf>
<wf cmd=done pos=NN lemma=property wnsn=1 lexsn=1:21:00::>property</wf>
<punc>''</punc>
<wf cmd=done pos=NN lemma=act wnsn=1 lexsn=1:10:01::>act</wf>
<wf cmd=done pos=VB lemma=seem wnsn=1 lexsn=2:39:00::>seemed</wf>
<wf cmd=done pos=JJ lemma=certain wnsn=4 lexsn=3:00:03::>certain</wf>
<wf cmd=done pos=NN lemma=thursday wnsn=1 lexsn=1:28:00::>Thursday</wf>
<wf cmd=ignore pos=IN>despite</wf>
<wf cmd=ignore pos=DT>the</wf>
<wf cmd=done pos=JJ lemma=adamant wnsn=1 lexsn=5:00:00:inflexible:02>adamant</wf>
<wf cmd=done pos=NN lemma=protest wnsn=1 lexsn=1:10:00::>protests</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NN lemma=texas wnsn=1 lexsn=1:15:00::>Texas</wf>
<wf cmd=done pos=NN lemma=banker wnsn=1 lexsn=1:18:00::>bankers</wf>
<punc>.</punc>
</s>
</p>
但文件中的后续数据以以下开头:

<p pnum=2>
<s snum=2>

........
</s>
</p>

........

这似乎有效:

//s[@snum]/string-join(wf | punc, " ")

我在(XPathOnline tester)上通过两次使用示例“p”标记的内容对其进行了验证。您可以将水平条向下拖动一点,以查看整个结果。

也许这个答案有帮助:可能重复@TimothyTruckle可能问题相似,但没有标记如何提取单词?我试图用我在网上找到的一些示例xml代码来提取它,但得到一个错误,错误是:与元素类型“contextfile”关联的属性“concordance”应该使用openquote“您输入的XML格式不正确,请与生成它的人交谈,以符合XML规范。@TimothyTruckle此数据来自布朗自然语言处理语料库,而不是内部生成的数据。
//s[@snum]/string-join(wf | punc, " ")