使用大型XML文件的XQuery Java性能_Java_Xquery_Saxon

使用大型XML文件的XQuery Java性能

java xquery

使用大型XML文件的XQuery Java性能,java,xquery,saxon,Java,Xquery,Saxon,我正在编写一些Xquery代码（使用SAXON）来针对大型XML文件执行一个简单的Xquery文件 XML文件（位于此.referenceDataPath）有300万个“行”节点，其形式如下： <row> <ISRC_NUMBER>1234567890</ISRC_NUMBER> </row> <row> <ISRC_NUMBER>1234567891</ISRC_NUMBER> </row> <

我正在编写一些Xquery代码（使用SAXON）来针对大型XML文件执行一个简单的Xquery文件

XML文件（位于此.referenceDataPath）有300万个“行”节点，其形式如下：

<row>
<ISRC_NUMBER>1234567890</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567891</ISRC_NUMBER>
</row>
<row>
<ISRC_NUMBER>1234567892</ISRC_NUMBER>
</row>

Java代码是：

private XQItem referenceDataItem;
private XQPreparedExpression xPrepExec;
private XQConnection conn;

//set connection string and xquery file
this.conn = new SaxonXQDataSource().getConnection();
InputStream queryFromFile = new FileInputStream(this.xqueryPath);

//Set the prepared expression 
InputStream is  = new FileInputStream(this.referenceDataPath);
this.referenceDataItem = conn.createItemFromDocument(is, null, null);
this.xPrepExec = conn.prepareExpression(queryFromFile);
xPrepExec.bindItem(new QName("refDocument"), this.referenceDataItem);   

//the code below is in a seperate method and called multiple times
public int getCount(String searchVal){

    xPrepExec.bindString(new QName("isrc"), searchVal, conn.createAtomicType   (XQItemType.XQBASETYPE_STRING));

    XQSequence resultsFromFile = xPrepExec.executeQuery();
    int count = Integer.parseInt(resultsFromFile.getSequenceAsString(new Properties()));
    return count;

}

方法getCount被连续调用多次（例如1000000次），以验证XML文件中许多值的存在性

对于每次调用getCount，Xquery查询的当前速度大约为500毫秒，考虑到XML文档在内存中，并且查询是准备好的，这似乎非常慢

我之所以使用XQuery，是为了在XML文件具有更复杂布局的未来工作中作为概念证明

我在8GB RAM的i7上运行代码，所以内存不是问题-我还增加了为程序分配的堆大小

关于如何提高这段代码的速度，有什么建议吗

谢谢

Zorba具有解析和查询大型XML文档的功能。有关它的一些文档可在

例如，在下面的代码片段中，我们通过HTTP解析一个700MB的文档，整个过程以自上而下的流方式进行：

import module namespace http = "http://expath.org/ns/http-client";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";

let $raw-data as xs:string := http:send-request(<http:request href="http://cf.zorba-xquery.com.s3.amazonaws.com/forecasts.xml" method="GET" override-media-type="text/plain" />)[2]
let $data := p:parse($raw-data, <opt:options><opt:parse-external-parsed-entity opt:skip-root-nodes="1"/></opt:options>)
return
    subsequence($data, 1, 2)

导入模块命名空间http=”http://expath.org/ns/http-client";
导入模块命名空间p=”http://www.zorba-xquery.com/modules/xml";
导入架构命名空间opt=”http://www.zorba-xquery.com/modules/xml-options";
将$raw数据设为xs:string:=http:send-request（）[2]
让$data:=p:parse（$raw data，）
返回
子序列（$data，1，2）

您可以在

上尝试此示例。对于如何提高速度的问题，最明显的答案是尝试Saxon EE，它有一个更强大的优化器，还使用字节码生成。我还没有尝试过，但我认为Saxon EE会检测到这个查询将从构建索引中受益，并且每次查询都会重复使用相同的索引

我要提出的另一个建议是声明变量$refDocument的类型，类型信息有助于优化器做出更明智的决策。例如，如果优化器知道$refDocument是单个节点，那么它知道$refDocument//X将自动按文档顺序排列，而不需要任何排序操作

用“eq”替换“=”运算符也值得一试。

谢谢！Zorba是否有Java API可以从本地文件而不是通过HTTP在XML文件中传输？Zorba有Java API，但您可以直接从XQuery解析本地文件。我们是针对wikibooks数据集进行的，现在还不清楚为什么流媒体应该是这个用例的一个优势，在这个用例中，许多查询都是针对一个可以保存在内存中的文档运行的。最后，我决定改用内存中的数据库——性能更好。

import module namespace http = "http://expath.org/ns/http-client";
import module namespace p = "http://www.zorba-xquery.com/modules/xml";
import schema namespace opt = "http://www.zorba-xquery.com/modules/xml-options";

let $raw-data as xs:string := http:send-request(<http:request href="http://cf.zorba-xquery.com.s3.amazonaws.com/forecasts.xml" method="GET" override-media-type="text/plain" />)[2]
let $data := p:parse($raw-data, <opt:options><opt:parse-external-parsed-entity opt:skip-root-nodes="1"/></opt:options>)
return
    subsequence($data, 1, 2)