使用Java API的OrientDB 2.0.0批量加载受CPU限制_Orientdb_Bulkinsert_Gdelt

使用Java API的OrientDB 2.0.0批量加载受CPU限制

orientdb

使用Java API的OrientDB 2.0.0批量加载受CPU限制,orientdb,bulkinsert,gdelt,Orientdb,Bulkinsert,Gdelt,我正在使用OrientDB 2.0.0测试它对批量数据加载的处理。对于示例数据，我使用的是谷歌GDELT项目（免费下载）中的GDELT数据集。我使用JavaAPI将总共约80M个顶点（每个顶点有8个属性）加载到空白图形数据库的V类中数据位于单个制表符分隔的文本文件（US-ASCII）中，因此我只是从上到下读取文本文件。我使用OIntentMassiveInsert（）配置了数据库，并将事务大小设置为每次提交25000条记录我使用的是一台8核机器，带有32G RAM和SSD，所以硬件不应该成为

我正在使用OrientDB 2.0.0测试它对批量数据加载的处理。对于示例数据，我使用的是谷歌GDELT项目（免费下载）中的GDELT数据集。我使用JavaAPI将总共约80M个顶点（每个顶点有8个属性）加载到空白图形数据库的V类中

数据位于单个制表符分隔的文本文件（

US-ASCII

）中，因此我只是从上到下读取文本文件。我使用

OIntentMassiveInsert（）

配置了数据库，并将事务大小设置为每次提交25000条记录

我使用的是一台8核机器，带有32G RAM和SSD，所以硬件不应该成为一个因素。我正在用Java8R31运行Windows7Pro

最初的2000万条（大约）记录进入的速度相当快，每批25000条记录不到2秒。我很受鼓舞

然而，随着该过程的继续运行，插入速率显著降低。减速似乎是相当线性的。以下是我的输出日志中的一些示例行：

Committed 25000 GDELT Event records to OrientDB in 4.09989189 seconds at a rate of 6097 records per second. Total = 31350000
Committed 25000 GDELT Event records to OrientDB in 9.42005182 seconds at a rate of 2653 records per second. Total = 40000000
Committed 25000 GDELT Event records to OrientDB in 15.883908716 seconds at a rate of 1573 records per second. Total = 45000000
Committed 25000 GDELT Event records to OrientDB in 45.814514946 seconds at a rate of 545 records per second. Total = 50000000

随着操作的进行，内存使用率一直保持不变，但OrientDB的CPU使用率却越来越高，与持续时间保持一致。起初，OrientDB Java进程使用了大约5%的CPU。它现在高达90%左右，利用率很好地分布在所有8个核心上

我是否应该将加载操作分解为几个顺序连接，或者它实际上是顶点数据如何在内部管理的一个函数，并且如果我停止该过程并继续插入我停止的位置也无关紧要

谢谢

[更新]进程最终因以下错误而终止： java.lang.OutOfMemoryError:超出GC开销限制

所有提交都得到了成功处理，我最终得到了略多于5100万条记录。我将重新构造加载程序，将1个巨大文件分解为许多较小的文件（例如，每个文件有1m条记录），并将每个文件视为单独的加载

完成后，我将尝试获取平面顶点列表并添加一些边。在尚未指定顶点ID的大容量插入上下文中，有何建议？谢谢

[Update 2]我正在使用Graph API。代码如下：

// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
OrientGraph txGraph = factory.getTx();

// Iterate row by row over the file.
while ((line = reader.readLine()) != null) {
    fields = line.split("\t");
    try {
        Vertex v = txGraph.addVertex(null); // 1st OPERATION: IMPLICITLY BEGIN A TRANSACTION
        for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
            v.setProperty(headerFieldsReduced[i], fields[i]);
        }

        // Commit every so often to balance performance and transaction size
        if (++counter % commitPoint == 0) {
            txGraph.commit();
        }

    } catch( Exception e ) {
        txGraph.rollback();
    }
}

//打开OrientDB数据库实例
OrientGraphFactory=new OrientGraphFactory（“远程：本地主机/gdelt”、“管理员”、“管理员”）；
工厂声明内容（新OIntentMassiveInsert（））；
OrientGraph txGraph=factory.getTx（）；
//在文件上逐行迭代。
而（（line=reader.readLine（））！=null）{
字段=行分割（“\t”）；
试一试{
Vertex v=txGraph.addVertex（null）；//第一个操作：隐式开始事务
对于（i=0；i


[更新3-2015-02-08]问题已解决
如果我更仔细地阅读文档，我会发现在批量加载中使用事务是错误的策略。我转而使用“NoTx”图，并大量添加顶点属性，它就像一个champ一样工作，不会随着时间的推移而减速，也不会影响CPU
我从数据库中的52m个顶点开始，在22分钟内以每秒14000多个顶点的速度增加了19m个顶点，每个顶点有16个属性
Map<String,Object> props = new HashMap<String,Object>();
// Open the OrientDB database instance
OrientGraphFactory factory = new OrientGraphFactory("remote:localhost/gdelt", "admin", "admin");
factory.declareIntent(new OIntentMassiveInsert());
graph = factory.getNoTx();
OrientVertex v = graph.addVertex(null); 
for (i = 0; i < headerFieldsReduced.length && i < fields.length; i++) {
    props.put(headerFieldsReduced[i], fields[i]);
}
v.setProperties(props);

Map props=newhashmap（）；
//打开OrientDB数据库实例
OrientGraphFactory=new OrientGraphFactory（“远程：本地主机/gdelt”、“管理员”、“管理员”）；
工厂声明内容（新OIntentMassiveInsert（））；
graph=factory.getNoTx（）；
OrientVertex v=graph.addVertex（null）；
对于（i=0；i
您使用的是哪种API？你能分享一段代码吗？谢谢，Lvca。我用相关代码部分更新了帖子。问题解决了。我用底部的[Update 3]更新了原始帖子，描述了解决方案。本质上，不要像文档所说的那样将事务用于批量加载。谢谢。你应该试试v2.2（beta版）并创建一个多线程导入器。这样你可以走得更快。