如何在Hadoop中序列化Java对象?

如何在Hadoop中序列化Java对象?,java,serialization,hadoop,Java,Serialization,Hadoop,对象应该实现可写接口,以便在Hadoop中传输时序列化。以LuceneScoreDoc类为例: public class ScoreDoc implements java.io.Serializable { /** The score of this document for the query. */ public float score; /** Expert: A hit document's number. * @see Searcher#doc(int) */

对象应该实现
可写
接口,以便在Hadoop中传输时序列化。以Lucene
ScoreDoc
类为例:

public class ScoreDoc implements java.io.Serializable {

  /** The score of this document for the query. */
  public float score;

  /** Expert: A hit document's number.
   * @see Searcher#doc(int) */
  public int doc;

  /** Only set by {@link TopDocs#merge} */
  public int shardIndex;

  /** Constructs a ScoreDoc. */
  public ScoreDoc(int doc, float score) {
    this(doc, score, -1);
  }

  /** Constructs a ScoreDoc. */
  public ScoreDoc(int doc, float score, int shardIndex) {
    this.doc = doc;
    this.score = score;
    this.shardIndex = shardIndex;
  }

  // A convenience method for debugging.
  @Override
  public String toString() {
    return "doc=" + doc + " score=" + score + " shardIndex=" + shardIndex;
  }
}
我应该如何使用
可写
接口序列化它?
writeable
java.io.serializable
接口之间有什么联系?

首先请看您可以使用java序列化或

看你需要做你自己的写和读函数,它非常简单,因为里面可以调用API来读和写int,flaot,string等等

您的示例可写(需要导入)


注意:写入和读取的顺序应该相同,否则一个值将转到另一个,如果您有不同的类型,则在读取时会出现序列化错误

我认为篡改内置Lucene类不是一个好主意。相反,您可以拥有自己的类,该类将包含ScoreDoc类型的字段,并将在接口中实现Hadoop可写。应该是这样的:

public class MyScoreDoc implements Writable  {      

  private ScoreDoc sd;

  public void write(DataOutput out) throws IOException {
      String [] splits = sd.toString().split(" ");

      // get the score value from the string
      Float score = Float.parseFloat((splits[0].split("="))[1]);

      // do the same for doc and shardIndex fields
      // ....    

      out.writeInt(score);
      out.writeInt(doc);
      out.writeInt(shardIndex);
  }

  public void readFields(DataInput in) throws IOException {
      float score = in.readInt();
      int doc = in.readInt();
      int shardIndex = in.readInt();

      sd = new ScoreDoc (score, doc, shardIndex);
  }

  //String toString()
}

当hadoop在映射器和还原器之间传递值时,这两种方法的内部区别是什么?@Denzel:简而言之,主要区别是一种方法可以工作,而另一种方法不能(因为hadoop依赖于
可写的
接口进行电汇):)它使用可写功能在网络上以最佳方式发送数据。。。为什么您需要直接在hadoop中传输
ScoreDoc
的实例(而不是按照其中一个答案的建议进行包装)?你能提供更多关于你的用例的细节吗?
public class MyScoreDoc implements Writable  {      

  private ScoreDoc sd;

  public void write(DataOutput out) throws IOException {
      String [] splits = sd.toString().split(" ");

      // get the score value from the string
      Float score = Float.parseFloat((splits[0].split("="))[1]);

      // do the same for doc and shardIndex fields
      // ....    

      out.writeInt(score);
      out.writeInt(doc);
      out.writeInt(shardIndex);
  }

  public void readFields(DataInput in) throws IOException {
      float score = in.readInt();
      int doc = in.readInt();
      int shardIndex = in.readInt();

      sd = new ScoreDoc (score, doc, shardIndex);
  }

  //String toString()
}