Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/353.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 要从tesseract hOCR XML输出中选择的XPathExpression_Java_Xpath_Tesseract_Jdom 2 - Fatal编程技术网

Java 要从tesseract hOCR XML输出中选择的XPathExpression

Java 要从tesseract hOCR XML输出中选择的XPathExpression,java,xpath,tesseract,jdom-2,Java,Xpath,Tesseract,Jdom 2,我有一个大致如下形状的文件: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en

我有一个大致如下形状的文件:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta name='ocr-system' content='tesseract 3.02' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "D:\DPC2\converted\60\60.tiff"; bbox 0 0 2479 3508; ppageno 0'>
       <!-- LOTS OF CONTENT -->
  </div>
 </body>
</html>

然后,我将JDOM 2.x与以下XPath查询一起使用:

//htmlFile is an input variable of type java.nio.Path
Document document = xmlBuilder.build(htmlFile.toFile());

XPathFactory factory = XPathFactory.instance();
XPathExpression<Element> xpePages = 
    factory.compile("//html/body/div[@class='ocr_page']", Filters.element());
List<Element> pages = xpePages.evaluate(document);
//htmlFile是java.nio.Path类型的输入变量
documentdocument=xmlBuilder.build(htmlFile.toFile());
XPathFactory=XPathFactory.instance();
XPathExpression xpePages=
compile(“//html/body/div[@class='ocr\u page']”,Filters.element());
列表页=xpePages.evaluate(文档);
但它永远找不到任何元素,我在查询中做错了什么?

名称空间

xmlns=”http://www.w3.org/1999/xhtml“
表示XML文件中没有前缀的元素实际上在
http://www.w3.org/1999/xhtml
namespace,您需要在XPath表达式中使用前缀指定它:

XPathExpression<Element> xpePages = 
    factory.compile("/h:html/h:body/h:div[@class='ocr_page']",
                    Filters.element(),
                    null, // no variables
                    Namespace.getNamespace("h", "http://www.w3.org/1999/xhtml"));
XPathExpression xpePages=
compile(“/h:html/h:body/h:div[@class='ocr\u page']”,
Filters.element(),
null,//没有变量
getNamespace(“h”http://www.w3.org/1999/xhtml"));
必须使用前缀,因为在XPath中,没有前缀总是意味着没有命名空间

<html xmlns="http://www.w3.org/1999/xhtml"
如果您确信元素的名称空间中没有冲突,那么可以选择只使用
local-name()


这是JDOM
XPathFactory
,而不是
javax.xml.xpath
one-namespace在JDOM中容易得多。
 //*[local-name()=='html' and namespace-uri()='http://www.w3.org/1999/xhtml']
 /*[local-name()='body' and namespace-uri()='http://www.w3.org/1999/xhtml']
 /* ... etc.
//*[local-name()=='html']/*[local-name()='body']* ...