Hadoop Giraph best的顶点输入格式,用于ID为String类型的输入文件

Hadoop Giraph best的顶点输入格式,用于ID为String类型的输入文件,hadoop,giraph,Hadoop,Giraph,我有一个多节点giraph集群在我的PC中正常工作。我从giraph执行了SimpleShortStPathExample,执行得很好 此算法是使用以下文件tiny_graph.txt运行的: [0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]] 此文件具有以下输入格式: [source_id,source_value,[[d

我有一个多节点giraph集群在我的PC中正常工作。我从giraph执行了SimpleShortStPathExample,执行得很好

此算法是使用以下文件tiny_graph.txt运行的:

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]
此文件具有以下输入格式:

[source_id,source_value,[[dest_id, edge_value],...]]
现在,我尝试在同一个集群中执行相同的算法,但输入文件与原始文件不同。我自己的文件如下所示:

[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]
Portada 0.0     Sugerencias     1.0
Proverbios      0.0
Neil    0.0     Luna    1.0     ideal   1.0     verdad  1.0     Categoria:Ingenieros    2.0     Categoria:Estadounidenses       2.0     Categoria:Astronautas   2.0
Categoria:Ingenieros    1.0     Neil    2.0
Categoria:Estadounidenses       1.0     Neil    2.0
Categoria:Astronautas   1.0     Neil    2.0
它与原始的非常相似,但id是字符串,顶点和边值很长。我的问题是我应该使用哪个TextInputFormat,因为我已经尝试了org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat和org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat,但我无法实现

解决了这个问题后,我可以调整原始的最短路径示例算法,让它对我的文件起作用,但在我找到解决方案之前,我无法达到这一点


如果这种格式不是一个好的决定,我可能会调整它,但我不知道哪一种是我最好的选择,我对giraph中文本输入和输出格式的了解非常糟糕,这就是为什么我在这里征求建议。

最好编写自己的输入格式。我建议使用字符串的哈希代码。我编写了一个示例代码,使每一行包括: [顶点id整数,例如字符串的哈希代码,顶点长度,[[邻居id整数,邻居长度],…]

public class JsonIntLongIntLongVertexInputFormat extends
  TextVertexInputFormat<IntWritable, LongWritable, LongWritable> {

  @Override
  public TextVertexReader createVertexReader(InputSplit split,
      TaskAttemptContext context) {
    return new JsonIntLongIntLongVertexReader();
  }


  class JsonIntLongIntLongVertexReader extends
    TextVertexReaderFromEachLineProcessedHandlingExceptions<JSONArray,
    JSONException> {

    @Override
    protected JSONArray preprocessLine(Text line) throws JSONException     {
      return new JSONArray(line.toString());
    }

    @Override
    protected IntWritable getId(JSONArray jsonVertex) throws JSONException,
              IOException {
      return new IntWritable(jsonVertex.getString(0).hashCode());
    }

    @Override
    protected LongWritable getValue(JSONArray jsonVertex) throws
      JSONException, IOException {
      return new LongWritable(jsonVertex.getLong(1));
    }

    @Override
    protected Iterable<Edge<IntWritable, LongWritable>> getEdges(
        JSONArray jsonVertex) throws JSONException, IOException {
      JSONArray jsonEdgeArray = jsonVertex.getJSONArray(2);
      List<Edge<IntWritable, LongWritable>> edges =
          Lists.newArrayListWithCapacity(jsonEdgeArray.length());
      for (int i = 0; i < jsonEdgeArray.length(); ++i) {
        JSONArray jsonEdge = jsonEdgeArray.getJSONArray(i);
        edges.add(EdgeFactory.create(new IntWritable(jsonEdge.getString(0).hashCode()),
            new LongWritable(jsonEdge.getLong(1))));
      }
      return edges;
    }

    @Override
    protected Vertex<IntWritable, LongWritable, LongWritable>
    handleException(Text line, JSONArray jsonVertex, JSONException e) {
      throw new IllegalArgumentException(
          "Couldn't get vertex from line " + line, e);
    }

  }
}

最好编写自己的inputformat。我建议使用字符串的哈希代码。我编写了一个示例代码,使每行由以下内容组成: [顶点id整数,例如字符串的哈希代码,顶点长度,[[邻居id整数,邻居长度],…]

public class JsonIntLongIntLongVertexInputFormat extends
  TextVertexInputFormat<IntWritable, LongWritable, LongWritable> {

  @Override
  public TextVertexReader createVertexReader(InputSplit split,
      TaskAttemptContext context) {
    return new JsonIntLongIntLongVertexReader();
  }


  class JsonIntLongIntLongVertexReader extends
    TextVertexReaderFromEachLineProcessedHandlingExceptions<JSONArray,
    JSONException> {

    @Override
    protected JSONArray preprocessLine(Text line) throws JSONException     {
      return new JSONArray(line.toString());
    }

    @Override
    protected IntWritable getId(JSONArray jsonVertex) throws JSONException,
              IOException {
      return new IntWritable(jsonVertex.getString(0).hashCode());
    }

    @Override
    protected LongWritable getValue(JSONArray jsonVertex) throws
      JSONException, IOException {
      return new LongWritable(jsonVertex.getLong(1));
    }

    @Override
    protected Iterable<Edge<IntWritable, LongWritable>> getEdges(
        JSONArray jsonVertex) throws JSONException, IOException {
      JSONArray jsonEdgeArray = jsonVertex.getJSONArray(2);
      List<Edge<IntWritable, LongWritable>> edges =
          Lists.newArrayListWithCapacity(jsonEdgeArray.length());
      for (int i = 0; i < jsonEdgeArray.length(); ++i) {
        JSONArray jsonEdge = jsonEdgeArray.getJSONArray(i);
        edges.add(EdgeFactory.create(new IntWritable(jsonEdge.getString(0).hashCode()),
            new LongWritable(jsonEdge.getLong(1))));
      }
      return edges;
    }

    @Override
    protected Vertex<IntWritable, LongWritable, LongWritable>
    handleException(Text line, JSONArray jsonVertex, JSONException e) {
      throw new IllegalArgumentException(
          "Couldn't get vertex from line " + line, e);
    }

  }
}

我通过调整自己的文件以适应org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat解决了这个问题。我的原始文件应该是这样的:

[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]
Portada 0.0     Sugerencias     1.0
Proverbios      0.0
Neil    0.0     Luna    1.0     ideal   1.0     verdad  1.0     Categoria:Ingenieros    2.0     Categoria:Estadounidenses       2.0     Categoria:Astronautas   2.0
Categoria:Ingenieros    1.0     Neil    2.0
Categoria:Estadounidenses       1.0     Neil    2.0
Categoria:Astronautas   1.0     Neil    2.0
数据之间的那些空格是制表符空格“\t”,因为此格式具有该选项作为将原始行拆分为多个字符串的预定标记值


无论如何,感谢@masoud sagharichian的帮助!!:D

我通过调整自己的文件以适应org.apache.giraph.io.formats.TextDoubleDoubleAdjacencyListVertexInputFormat解决了这个问题。我的原始文件应该是这样的:

[Portada,0,[[Sugerencias para la cita del día,1]]]
[Proverbios españoles,0,[]]
[Neil Armstrong,0,[[Luna,1][ideal,1][verdad,1][Categoria:Ingenieros,2,[Categoria:Estadounidenses,2][Categoria:Astronautas,2]]]
[Categoria:Ingenieros,1,[[Neil Armstrong,2]]]
[Categoria:Estadounidenses,1,[[Neil Armstrong,2]]]
[Categoria:Astronautas,1,[[Neil Armstrong,2]]]
Portada 0.0     Sugerencias     1.0
Proverbios      0.0
Neil    0.0     Luna    1.0     ideal   1.0     verdad  1.0     Categoria:Ingenieros    2.0     Categoria:Estadounidenses       2.0     Categoria:Astronautas   2.0
Categoria:Ingenieros    1.0     Neil    2.0
Categoria:Estadounidenses       1.0     Neil    2.0
Categoria:Astronautas   1.0     Neil    2.0
数据之间的那些空格是制表符空格“\t”,因为此格式具有该选项作为将原始行拆分为多个字符串的预定标记值


谢谢@masoud sagharichian,谢谢你的帮助!!:D

再次感谢你回答我的一个问题masoud,我终于开始编辑我自己的数据以适应giraph格式,但是你的回答在未来会非常有用再次回答我的一个问题masoud,我终于开始编辑我自己的数据以适应giraph格式,bu你的答案将来会很有用