Hadoop JanusGraph使用ScriptInputFormat加载CSV_Hadoop_Groovy_Gremlin_Janusgraph

Hadoop JanusGraph使用ScriptInputFormat加载CSV

hadoop groovy gremlin

Hadoop JanusGraph使用ScriptInputFormat加载CSV,hadoop,groovy,gremlin,janusgraph,Hadoop,Groovy,Gremlin,Janusgraph,我正在尝试将CSV文件加载到JanusGraph中。据我所知，我需要创建我的图形和模式，然后使用BulkLoadServerTex程序和我自己的定制groovy脚本来解析我的csv文件。这样做，它似乎可以工作，因为我可以看到顶点，但边没有创建我的配置似乎与加载CSV文件时可以找到的所有示例几乎相同，但一定有一些我不理解或忘记的东西是否可以从CSV文件批量加载边缘这是我的设置： gremlin> :load data/defineNCBIOSchema.groovy ==>

我正在尝试将CSV文件加载到JanusGraph中。据我所知，我需要创建我的图形和模式，然后使用BulkLoadServerTex程序和我自己的定制groovy脚本来解析我的csv文件。这样做，它似乎可以工作，因为我可以看到顶点，但边没有创建

我的配置似乎与加载CSV文件时可以找到的所有示例几乎相同，但一定有一些我不理解或忘记的东西

是否可以从CSV文件批量加载边缘

这是我的设置：

gremlin> :load data/defineNCBIOSchema.groovy
==>true
gremlin> graph = JanusGraphFactory.open('conf/gremlin-server/socket-janusgraph-apr-test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> defineNCBIOSchema(graph)
==>null
gremlin> graph.close()
==>null

gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-script.properties')
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('conf/gremlin-server/socket-janusgraph-apr-test.properties').create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get()
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> graph.close()
==>null

gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-load.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.E() <--- returns nothing

我使用默认的bin/janusgraph.sh脚本启动cassandra

我的小精灵命令：

gremlin> :load data/defineNCBIOSchema.groovy
==>true
gremlin> graph = JanusGraphFactory.open('conf/gremlin-server/socket-janusgraph-apr-test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> defineNCBIOSchema(graph)
==>null
gremlin> graph.close()
==>null

gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-script.properties')
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph('conf/gremlin-server/socket-janusgraph-apr-test.properties').create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).workers(1).program(blvp).submit().get()
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> graph.close()
==>null

gremlin> graph = GraphFactory.open('conf/hadoop-graph/apr-test-hadoop-load.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.E() <--- returns nothing

bulkLoader的我的图：（conf/hadoop-graph/apr-test-hadoop-script.properties）

读取图：（conf/hadoop-graph/apr-test-hadoop-load.properties）

我的groovy脚本

class Globals {
    static String[] h = [];
    static int lineNumber = 0;
}

def parse(line, factory) {
    def vertexType = 'Disease'
    def edgeLabel = 'parent'
    def parentsIndex = 2;

    Globals.lineNumber++

    // columns ignoring quoted ,
    def c = line.split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/)

    //  if first column is Class ID ignore the line, it is the header line
    if (c[0] == /ClassID/) {
        Globals.h = c
        return null
    }

    def v1 = graph.addVertex(T.id, c[0], T.label, vertexType)

    for (i = 0; i < c.length; ++i) {
        if (i != parentsIndex) { // Ignore parent
            def f = removeInvalidChar(c[i])
            if (f?.trim()) {
                v1.property(Globals.h[i], f)
            }
        }
    }

    def parents = []    
    if (c.length > parentsIndex) {
        parents = c[parentsIndex].split(/\|/)
    }

    for (i = 0; i < parents.size(); ++i) {
        def v2 = graph.addVertex(T.id, parents[i], T.label, vertexType)
        v1.addInEdge(edgeLabel, v2)             
    }

    return v1
}

def removeInvalidChar(col) {

    def f = col.replaceAll(/^\"|\"$/, "") // Remove quotes
    f = f.replaceAll(/\{/, /(/) // Remove {
    f = f.replaceAll(/\}/, /)/) // Remove }

    if (f == /label/) {
        f = /label2/
    }

    return f
}

CSV

ClassID,PreferredLabel,Parents Vertex3,Prefered Label 3, Vertex2,Prefered Label 2,Vertex3 Vertex1,Prefered Label 1,Vertex2|Vertex3

class Globals { static String[] h = []; static int lineNumber = 0; } def parse(line, factory) { def vertexType = 'Disease' def edgeLabel = 'parent' def parentsIndex = 2; Globals.lineNumber++ // columns ignoring quoted , def c = line.split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/) // if first column is Class ID ignore the line, it is the header line if (c[0] == /ClassID/) { Globals.h = c return null } def v1 = graph.addVertex(T.id, c[0], T.label, vertexType) for (i = 0; i < c.length; ++i) { if (i != parentsIndex) { // Ignore parent def f = removeInvalidChar(c[i]) if (f?.trim()) { v1.property(Globals.h[i], f) } } } def parents = [] if (c.length > parentsIndex) { parents = c[parentsIndex].split(/\|/) } for (i = 0; i < parents.size(); ++i) { def v2 = graph.addVertex(T.id, parents[i], T.label, vertexType) v1.addInEdge(edgeLabel, v2) } return v1 } def removeInvalidChar(col) { def f = col.replaceAll(/^\"|\"$/, "") // Remove quotes f = f.replaceAll(/\{/, /(/) // Remove { f = f.replaceAll(/\}/, /)/) // Remove } if (f == /label/) { f = /label2/ } return f }

def defineNCBIOSchema(graph) { mgmt = graph.openManagement() // vertex labels vertexLabel = mgmt.makeVertexLabel('Disease').make() // edge labels parent = mgmt.makeEdgeLabel('parent').multiplicity(MULTI).make() // vertex and edge properties blid = mgmt.makePropertyKey('bulkLoader.vertex.id').dataType(String.class).make() classID = mgmt.makePropertyKey('ClassID').dataType(String.class).cardinality(Cardinality.SINGLE).make() preferedLabel = mgmt.makePropertyKey('PreferredLabel').dataType(String.class).cardinality(Cardinality.SINGLE).make() // global indices mgmt.buildIndex('ClassIDIndex', Vertex.class).addKey(classID).unique() mgmt.commit() }

ClassID,PreferredLabel,Parents Vertex3,Prefered Label 3, Vertex2,Prefered Label 2,Vertex3 Vertex1,Prefered Label 1,Vertex2|Vertex3