Graph 使用OrientDB ETL将简单csv文件导入图形的最简单方法

Graph 使用OrientDB ETL将简单csv文件导入图形的最简单方法,graph,import,etl,orientdb,nosql,Graph,Import,Etl,Orientdb,Nosql,我想将一个非常简单的csv定向图文件导入OrientDB。具体地说,该文件是来自快照集合的roadNet PA数据集。文件的第一行如下所示: # Directed graph (each unordered pair of nodes is saved once) # Pennsylvania road network # Nodes: 1088092 Edges: 3083796 # FromNodeId ToNodeId 0 1 0 6309 0 6

我想将一个非常简单的csv定向图文件导入OrientDB。具体地说,该文件是来自快照集合的roadNet PA数据集。文件的第一行如下所示:

# Directed graph (each unordered pair of nodes is saved once)
# Pennsylvania road network
# Nodes: 1088092 Edges: 3083796
# FromNodeId    ToNodeId
0       1
0       6309
0       6353
1       0
6353    0
6353    6354
只有一种类型的顶点(道路交叉点),边没有任何信息(我认为OrientDB轻型边是最好的选择)。还请注意,顶点之间用制表符隔开

我尝试创建一个简单的etl来导入文件,但没有成功。以下是etl:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "   ", "skipFrom": 1, "skipTo": 4 } },
    { "vertex": { "class": "Intersection" } },
    { "edge": { "class": "Road" } }
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 
etl可以工作,但它不会像我预期的那样导入文件。我想问题出在变压器上。我的想法是逐行读取csv,并创建和连接两个顶点的边,但我不确定如何在etl文件中表达这一点。有什么想法吗?

试试这个:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roadNet-PA.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "\t", "skipFrom": 1, "skipTo": 4,
               "columnsOnFirstLine": false, 
               "columns":["id", "to"] } },
    { "vertex": { "class": "Intersection" } },
    { "merge": { "joinFieldName":"id", "lookup":"Intersection.id" } },
    { "edge": {
       "class": "Road",
       "joinFieldName": "to",
       "lookup": "Intersection.id",
       "unresolvedLinkAction": "CREATE"
      }
    },
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "wal": false,
       "batchCommit": 1000,
       "tx": true,
       "txUseLog": false,
       "useLightweightEdges" : true,
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["id:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 
为了加快加载速度,我建议您关闭服务器,并使用“plocal:”而不是“remote:”导入ETL。用以下内容替换现有内容的示例:

       "dbURL": "plocal:/orientdb/databases/roads",

它终于奏效了。我已经按照Luca的建议将合并移到顶点线之前。我还将“id”字段更改为“from”,以避免出现“属性键为所有元素id保留”的错误。以下是片段:

{
  "config": {
    "log": "debug"
  },
  "source" : {
    "file": { "path": "/tmp/roads.csv" }
  },
  "extractor": { "row": {} },
  "transformers": [
    { "csv": { "separator": "\t",
               "columnsOnFirstLine": false, 
               "columns":["from", "to"] } },
    { "merge": { "joinFieldName":"from", "lookup":"Intersection.from" } },
    { "vertex": { "class": "Intersection" } },
    { "edge": {
       "class": "Road",
       "joinFieldName": "to",
       "lookup": "Intersection.from",
       "unresolvedLinkAction": "CREATE"
      }
    },
  ],
  "loader": {
    "orientdb": {
       "dbURL": "remote:localhost/roads",
       "dbType": "graph",
       "wal": false,
       "batchCommit": 1000,
       "tx": true,
       "txUseLog": false,
       "useLightweightEdges" : true,
       "classes": [
         {"name": "Intersection", "extends": "V"},
         {"name": "Road", "extends": "E"}
       ], "indexes": [
         {"class":"Intersection", "fields":["from:integer"], "type":"UNIQUE" }
       ]
    }
  }
} 

谢谢你的回答。我不确定我是否做错了什么,但我发现了两个错误。首先,skipFrom和skipTo配置在第一行传输到变压器时不起作用。我已经手动删除了这些线条,并且发现了第二个问题:OrientVertex无法强制转换为OdoDocument。这是日志,请尝试在顶点之前移动合并