Java 如何获取XML文件的特定信息
我有一个很大的Java 如何获取XML文件的特定信息,java,parsing,xlm,Java,Parsing,Xlm,我有一个很大的XML文件,下面是它的摘录: ... <LexicalEntry id="Ait~ifAq_1"> <Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/> <Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/> <WordForm formType="root" writtenForm="وفق"/> </L
XML
文件,下面是它的摘录:
...
<LexicalEntry id="Ait~ifAq_1">
<Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/>
<Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/>
<WordForm formType="root" writtenForm="وفق"/>
</LexicalEntry>
<LexicalEntry id="tawaA&um__1">
<Lemma partOfSpeech="n" writtenForm="تَوَاؤُم"/>
<Sense id="tawaA&um__1_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="وأم"/>
</LexicalEntry>
<LexicalEntry id="tanaAgum_2">
<Lemma partOfSpeech="n" writtenForm="تناغُم"/>
<Sense id="tanaAgum_2_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
<WordForm formType="root" writtenForm="نغم"/>
</LexicalEntry>
<Synset baseConcept="3" id="tawaAfuq_n1AR">
<SynsetRelations>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
<SynsetRelation relType="hypernym" targets="ext_noun_NP_420"/>
</SynsetRelations>
<MonolingualExternalRefs>
<MonolingualExternalRef externalReference="13971065-n" externalSystem="PWN30"/>
</MonolingualExternalRefs>
</Synset>
...
解决方案之一是由于内存消耗而使用流读取器。但我不知道我该如何得到我想要的。请帮帮我。如果此XML文件太大,无法在内存中表示,请使用SAX 您需要编写SAX解析器来维护位置。为此,我通常使用一个StringBuffer,但是一堆字符串也可以很好地工作。这一部分很重要,因为它将允许您跟踪返回文档根目录的路径,这将允许您了解在给定时间点您在文档中的位置(在尝试仅提取少量信息时很有用) 主逻辑流如下所示:
1. When entering a node, add the node's name to the stack.
2. When exiting a node, pop the node's name (top element) off the stack.
3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
4. When entering a region you care about, clear the buffer you will capture the characters into
5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.
通过这种方式,您可以有效地跳过XML树中您不关心的所有分支。如果此XML文件太大,无法在内存中表示,请使用SAX 您需要编写SAX解析器来维护位置。为此,我通常使用一个StringBuffer,但是一堆字符串也可以很好地工作。这一部分很重要,因为它将允许您跟踪返回文档根目录的路径,这将允许您了解在给定时间点您在文档中的位置(在尝试仅提取少量信息时很有用) 主逻辑流如下所示:
1. When entering a node, add the node's name to the stack.
2. When exiting a node, pop the node's name (top element) off the stack.
3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
4. When entering a region you care about, clear the buffer you will capture the characters into
5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.
通过这种方式,您可以有效地跳过XML树中您不关心的所有分支。SAX解析器不同于DOM解析器。它只查看当前的
项
,在将来的项成为当前的项之前,它无法查看这些项。当XML文件非常大时,可以使用它。取而代之的是很多人。举几个例子:
SAX
解析器
DOM
解析器
JDOM
解析器
DOM4J
PARSER
STAX
PARSER
你可以找到所有这些教程
在我看来,学习后,直接使用DOM4J
或JDOM
进行商业产品
SAX
解析器的逻辑是,您有一个MyHandler
类,它扩展了DefaultHandler
和@覆盖了它的一些方法:
XML文件:
<?xml version="1.0"?>
<class>
<student rollno="393">
<firstname>dinkar</firstname>
<lastname>kad</lastname>
<nickname>dinkar</nickname>
<marks>85</marks>
</student>
<student rollno="493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>vinni</nickname>
<marks>95</marks>
</student>
<student rollno="593">
<firstname>jasvir</firstname>
<lastname>singn</lastname>
<nickname>jazz</nickname>
<marks>90</marks>
</student>
</class>
主类类:
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("input.txt");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
SAX解析器不同于DOM解析器。它只查看当前的项
,在将来的项成为当前的项之前,它无法查看这些项。当XML文件非常大时,可以使用它。取而代之的是很多人。举几个例子:
SAX
解析器
DOM
解析器
JDOM
解析器
DOM4J
PARSER
STAX
PARSER
你可以找到所有这些教程
在我看来,学习后,直接使用DOM4J
或JDOM
进行商业产品
SAX
解析器的逻辑是,您有一个MyHandler
类,它扩展了DefaultHandler
和@覆盖了它的一些方法:
XML文件:
<?xml version="1.0"?>
<class>
<student rollno="393">
<firstname>dinkar</firstname>
<lastname>kad</lastname>
<nickname>dinkar</nickname>
<marks>85</marks>
</student>
<student rollno="493">
<firstname>Vaneet</firstname>
<lastname>Gupta</lastname>
<nickname>vinni</nickname>
<marks>95</marks>
</student>
<student rollno="593">
<firstname>jasvir</firstname>
<lastname>singn</lastname>
<nickname>jazz</nickname>
<marks>90</marks>
</student>
</class>
主类类:
import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("input.txt");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
就是为这个设计的。Java在包中提供了对它的支持
要执行所需操作,代码将如下所示:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[@id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
匹配XML文档中任何包含以下内容的
元素:
- 具有writenform属性的WordForm子级,其值等于
word
变量
- 具有writenform属性的引理子级,其值等于
word
变量
对于每个这样的
元素,返回作为
元素的直接子元素的任何
元素的synset
属性的值
在计算xpath表达式之前,word
变量由xpath.setXPathVariableResolver
外部定义
//Synset[@id=$id]/SynsetRelations/SynsetRelations
匹配XML文档中id
属性等于id
变量的任何
元素。对于每个这样的
元素,查找任何direct SynsetRelations子元素,并返回其每个direct SynsetRelations子元素
在计算xpath表达式之前,id
变量由xpath.setXPathVariableResolver
外部定义
//LexicalEntry[Sense/@synset=$synset]/WordForm/@writenform
匹配XML文档中具有子元素的任何
元素,该子元素具有值与synset
变量相同的synset
属性。对于每个匹配的元素,找到任何
子元素并返回该元素的writenform
属性
在计算xpath表达式之前,synset
变量由xpath.setXPathVariableResolver
外部定义
从逻辑上讲,上述内容应该是:
- 找到请求字的synset值
- 使用synset值查找SynsetRelation元素
- 找到对应于每个匹配SynsetRelation的目标值的writtenForm值
正是为此而设计的。Java在包中提供了对它的支持
要执行所需操作,代码将如下所示:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[@id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
匹配XML文档中任何包含以下内容的
元素:
- 具有writenform属性的WordForm子级,其值等于
word
变量
- 具有writenformattri的引理子