Java ApacheTika:如何使用XPath查询
我正在使用ApacheTika解析一个XML文件。我想从XML中提取某些标记及其内容,并将它们存储在HashMap中。现在,我可以提取XML的全部内容,但是标记丢失了Java ApacheTika:如何使用XPath查询,java,xml,xpath,apache-tika,Java,Xml,Xpath,Apache Tika,我正在使用ApacheTika解析一个XML文件。我想从XML中提取某些标记及其内容,并将它们存储在HashMap中。现在,我可以提取XML的全部内容,但是标记丢失了 //detecting the file type BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = null; try
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try
{
inputstream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
}
catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
这显示了XML的全部内容
现在,我想提取XML的某些部分,因为Tika允许XPath查询,所以我尝试了这个方法
XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
Matcher divContentMatcher = xhtmlParser.parse("/Product/Source/Publisher/PublisherName[@nameType='Person']");
ContentHandler xhandler = new MatchingContentHandler(
new ToXMLContentHandler(), divContentMatcher);
AutoDetectParser parser = new AutoDetectParser();
Metadata xmetadata = new Metadata();
try (FileInputStream stream = new FileInputStream(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()))) {
parser.parse(stream, xhandler, xmetadata);
System.out.println(xhandler.toString());
} catch (URISyntaxException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
但它没有显示任何输出!我希望它只提供XQuery中指定的节点
知道发生了什么吗
顺便说一下,下面是相应的XML
<Product productID="xvc22" shortProductID="x" language="en">
<ProductStatus statusType="Published" />
<Source>
<Publisher sequence="1" primaryIndicator="Yes">
<PublisherID idType="Shortname">jjkjkj</PublisherID>
<PublisherID idType="BM">6666</PublisherID>
<PublisherName nameType="Legal">ABT</PublisherName>
<PublisherName nameType="Person">
<LastName>pppp</LastName>
<FirstName>lkkk</FirstName>
</PublisherName>
</Publisher>
</Source>
</Product>
这是打印出来的
pppp
lkkk
这是完美的。那么为什么Tika不能解析XPath查询呢?您似乎在向Tika询问文档的纯文本版本,这也难怪标记会被删除。如果您向Tika索要文档的XHTML版本,会发生什么情况?谢谢,请查看编辑。这就是你说的吗?请看编辑。我做了一些改变
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(ParseXML.class.getClassLoader().getResource("xml/a.xml").toURI()));
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/Product/Source/Publisher/PublisherName[@nameType='Person']");
System.out.println(expr.evaluate(doc, XPathConstants.STRING));
pppp
lkkk