Hadoop JanusGraph使用ScriptInputFormat加载CSV
我正在尝试将CSV文件加载到JanusGraph中。 据我所知,我需要创建我的图形和模式,然后使用BulkLoadServerTex程序和我自己的定制groovy脚本来解析我的csv文件。 这样做,它似乎可以工作,因为我可以看到顶点,但边没有创建 我的配置似乎与加载CSV文件时可以找到的所有示例几乎相同,但一定有一些我不理解或忘记的东西 是否可以从CSV文件批量加载边缘Hadoop JanusGraph使用ScriptInputFormat加载CSV,hadoop,groovy,gremlin,janusgraph,Hadoop,Groovy,Gremlin,Janusgraph,我正在尝试将CSV文件加载到JanusGraph中。 据我所知,我需要创建我的图形和模式,然后使用BulkLoadServerTex程序和我自己的定制groovy脚本来解析我的csv文件。 这样做,它似乎可以工作,因为我可以看到顶点,但边没有创建 我的配置似乎与加载CSV文件时可以找到的所有示例几乎相同,但一定有一些我不理解或忘记的东西 是否可以从CSV文件批量加载边缘 这是我的设置: gremlin> :load data/defineNCBIOSchema.groovy ==>
这是我的设置:
gremlin> :load data/defineNCBIOSchema.groovy
==>true
gremlin> graph = JanusGraphFactory.open('conf/gremlin-server/socket-janusgraph-apr-test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> defineNCBIOSchema(graph)
==>null
gremlin> graph.close()
==>null
gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-script.properties')
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('conf/gremlin-server/socket-janusgraph-apr-test.properties').create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get()
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> graph.close()
==>null
gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-load.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.E() <--- returns nothing
我使用默认的bin/janusgraph.sh脚本启动cassandra
我的小精灵命令:
gremlin> :load data/defineNCBIOSchema.groovy
==>true
gremlin> graph = JanusGraphFactory.open('conf/gremlin-server/socket-janusgraph-apr-test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> defineNCBIOSchema(graph)
==>null
gremlin> graph.close()
==>null
gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-script.properties')
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('conf/gremlin-server/socket-janusgraph-apr-test.properties').create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get()
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> graph.close()
==>null
gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-load.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.E() <--- returns nothing
bulkLoader的我的图:(conf/hadoop-graph/apr-test-hadoop-script.properties)
读取图:(conf/hadoop-graph/apr-test-hadoop-load.properties)
我的groovy脚本
class Globals {
static String[] h = [];
static int lineNumber = 0;
}
def parse(line, factory) {
def vertexType = 'Disease'
def edgeLabel = 'parent'
def parentsIndex = 2;
Globals.lineNumber++
// columns ignoring quoted ,
def c = line.split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/)
// if first column is Class ID ignore the line, it is the header line
if (c[0] == /ClassID/) {
Globals.h = c
return null
}
def v1 = graph.addVertex(T.id, c[0], T.label, vertexType)
for (i = 0; i < c.length; ++i) {
if (i != parentsIndex) { // Ignore parent
def f = removeInvalidChar(c[i])
if (f?.trim()) {
v1.property(Globals.h[i], f)
}
}
}
def parents = []
if (c.length > parentsIndex) {
parents = c[parentsIndex].split(/\|/)
}
for (i = 0; i < parents.size(); ++i) {
def v2 = graph.addVertex(T.id, parents[i], T.label, vertexType)
v1.addInEdge(edgeLabel, v2)
}
return v1
}
def removeInvalidChar(col) {
def f = col.replaceAll(/^\"|\"$/, "") // Remove quotes
f = f.replaceAll(/\{/, /(/) // Remove {
f = f.replaceAll(/\}/, /)/) // Remove }
if (f == /label/) {
f = /label2/
}
return f
}
CSV
ClassID,PreferredLabel,Parents
Vertex3,Prefered Label 3,
Vertex2,Prefered Label 2,Vertex3
Vertex1,Prefered Label 1,Vertex2|Vertex3
class Globals {
static String[] h = [];
static int lineNumber = 0;
}
def parse(line, factory) {
def vertexType = 'Disease'
def edgeLabel = 'parent'
def parentsIndex = 2;
Globals.lineNumber++
// columns ignoring quoted ,
def c = line.split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/)
// if first column is Class ID ignore the line, it is the header line
if (c[0] == /ClassID/) {
Globals.h = c
return null
}
def v1 = graph.addVertex(T.id, c[0], T.label, vertexType)
for (i = 0; i < c.length; ++i) {
if (i != parentsIndex) { // Ignore parent
def f = removeInvalidChar(c[i])
if (f?.trim()) {
v1.property(Globals.h[i], f)
}
}
}
def parents = []
if (c.length > parentsIndex) {
parents = c[parentsIndex].split(/\|/)
}
for (i = 0; i < parents.size(); ++i) {
def v2 = graph.addVertex(T.id, parents[i], T.label, vertexType)
v1.addInEdge(edgeLabel, v2)
}
return v1
}
def removeInvalidChar(col) {
def f = col.replaceAll(/^\"|\"$/, "") // Remove quotes
f = f.replaceAll(/\{/, /(/) // Remove {
f = f.replaceAll(/\}/, /)/) // Remove }
if (f == /label/) {
f = /label2/
}
return f
}
def defineNCBIOSchema(graph) {
mgmt = graph.openManagement()
// vertex labels
vertexLabel = mgmt.makeVertexLabel('Disease').make()
// edge labels
parent = mgmt.makeEdgeLabel('parent').multiplicity(MULTI).make()
// vertex and edge properties
blid = mgmt.makePropertyKey('bulkLoader.vertex.id').dataType(String.class).make()
classID = mgmt.makePropertyKey('ClassID').dataType(String.class).cardinality(Cardinality.SINGLE).make()
preferedLabel = mgmt.makePropertyKey('PreferredLabel').dataType(String.class).cardinality(Cardinality.SINGLE).make()
// global indices
mgmt.buildIndex('ClassIDIndex', Vertex.class).addKey(classID).unique()
mgmt.commit()
}
ClassID,PreferredLabel,Parents
Vertex3,Prefered Label 3,
Vertex2,Prefered Label 2,Vertex3
Vertex1,Prefered Label 1,Vertex2|Vertex3