Graph 将批量数据导入ArangoDB的最佳方法_Graph_Arangodb_Pyarango

Graph 将批量数据导入ArangoDB的最佳方法

graph arangodb

Graph 将批量数据导入ArangoDB的最佳方法,graph,arangodb,pyarango,Graph,Arangodb,Pyarango,我目前正在研究ArangoDB POC。我发现ArangoDB和PyArango的文档创建时间非常长。插入300个文档大约需要5分钟。我已经在下面粘贴了粗略的代码，请告诉我是否有更好的方法来加快速度： with open('abc.csv') as fp: for line in fp: dataList = line.split(",") aaa = dbObj['aaa'].createDocument() bbb = dbObj['bbb'].createDocu

我目前正在研究ArangoDB POC。我发现ArangoDB和PyArango的文档创建时间非常长。插入300个文档大约需要5分钟。我已经在下面粘贴了粗略的代码，请告诉我是否有更好的方法来加快速度：

with open('abc.csv') as fp:
for line in fp:
    dataList = line.split(",")

    aaa = dbObj['aaa'].createDocument()
    bbb = dbObj['bbb'].createDocument() 
    ccc = dbObj['ccc'].createEdge()

    bbb['bbb'] = dataList[1]
    aaa['aaa'] = dataList[0]
    aaa._key = dataList[0]

    aaa.save()
    bbb.save()

    ccc.links(aaa,bbb)
    ccc['related_to'] = "gfdgf"
    ccc['weight'] = 0

    ccc.save()

不同的集合由以下代码创建：

 dbObj.createCollection(className='aaa', waitForSync=False)

我将构建所有要插入json格式字符串的数据，并使用createDocumentRaw一次保存创建所有数据。

针对arango java驱动程序中批处理模式的问题。如果知道顶点的关键点属性，可以通过“collectionName”+“/”+“documentKey”构建文档句柄。例如：

arangoDriver.startBatchMode（）；
用于（字符串行：行）
{
String[]data=line.split（“，”）；
BaseDocument设备=新的BaseDocument（）；
BaseDocument phyAddress=新的BaseDocument（）；
BaseDocument conn=新的BaseDocument（）；
字符串键设备=数据[0]；
字符串handleDevice=“DeviceId/”+keyDevice；
设备。setDocumentKey（键设备）；
addAttribute（“device_id”，数据[0]）；
字符串keyPhyAddress=数据[1]；
字符串handlePhyAddress=“physicalocation/”+keyPhyAddress；
phyAddress.setDocumentKey（keyPhyAddress）；
phyAddress.addAttribute（“地址”，数据[1]）；
final DocumentEntity from=arangoDriver.graphCreateVertex（“testGraph”，“DeviceId”，设备，null）；
final DocumentEntity to=arangoDriver.graphCreateVertex（“testGraph”，“PhysicalLocation”，phyAddress，null）；
graphCreateEdge（“testGraph”、“DeviceId_PhysicalLocation”、null、handleDevice、HandlePyAddress、null、null）；
}
arangoDriver.executeBatch（）；
你考虑过了吗？如果需要，您可以使用Python预处理源数据，但是导入本身应该由arangoimp完成，arangoimp使用ArangoDB的批量导入API来提高效率。我查看了arangoimp，它看起来数据应该是JSON、CSV文件的形式，并且已经存在主键。我在这里看到的问题是，我将无法处理重复节点，这可能已经存在于数据库中。是否有任何特定的选项来处理这些情况，我在ArangoImp文档中找不到任何选项Pyarango目前在HTTP保持活动状态方面存在问题-它不会重用连接，因此最终会重新进行DNS查找。我们目前正在调查这可能是什么原因以及如何修复它。@dothebart:谢谢你提供的信息。我最近尝试了ArangoDB的java驱动程序，启用了批处理模式，它以大约1000个文档/秒的速率传输数据，但是边缘集合没有在其中得到更新。删除批处理模式后，相同的代码更新了边缘集合。如果您对此有任何想法，请告诉我。@pjesudhas:arangoimp支持JSON、CSV和TSV。文档密钥不需要存在。如果您不使用它，ArangoDB将为您生成一个密钥。当然，您需要边的\u from
和\u to
属性来告诉它要链接哪些文档。这是使用文档id（\u id，集合名称+“/”+文档键）完成的。要链接的文档实际上不必存在，只有集合必须存在。有关如何处理重复文档，请参见此处：
arangoDriver.startBatchMode();

for(String line : lines)
{
  String[] data = line.split(",");

  BaseDocument device = new BaseDocument();
  BaseDocument phyAddress = new BaseDocument(); 
  BaseDocument conn = new BaseDocument();

  String keyDevice = data[0];
  String handleDevice = "DeviceId/" + keyDevice; 

  device.setDocumentKey(keyDevice);

  device.addAttribute("device_id",data[0]);

  String keyPhyAddress = data[1];
  String handlePhyAddress = "PhysicalLocation/" + keyPhyAddress; 

  phyAddress.setDocumentKey(keyPhyAddress);

  phyAddress.addAttribute("address",data[1]);

  final DocumentEntity<BaseDocument> from = arangoDriver.graphCreateVertex("testGraph", "DeviceId", device, null);       
  final DocumentEntity<BaseDocument> to = arangoDriver.graphCreateVertex("testGraph", "PhysicalLocation", phyAddress, null);

  arangoDriver.graphCreateEdge("testGraph", "DeviceId_PhysicalLocation", null, handleDevice, handlePhyAddress, null, null);

}
arangoDriver.executeBatch();