Apache spark 如何使用Apache Spark读取/写入协议缓冲区消息？_Apache Spark_Hdfs_Protocol Buffers_Sequencefile

Apache spark 如何使用Apache Spark读取/写入协议缓冲区消息？

apache-spark protocol-buffers

Apache spark 如何使用Apache Spark读取/写入协议缓冲区消息？,apache-spark,hdfs,protocol-buffers,sequencefile,Apache Spark,Hdfs,Protocol Buffers,Sequencefile,我想使用ApacheSpark从HDFS读取/写入协议缓冲区消息。我找到了以下建议方法： 1）使用Google的Gson库将protobuf消息转换为Json，然后通过SparkSql读/写它们。此解决方案在中进行了解释，但我认为这样做（转换为json）是一项额外的任务 2）转换为拼花文件。有和github项目可以这样做，但我不想要拼花文件，因为我总是处理所有列（而不是某些列），这样拼花格式不会给我任何好处（至少我认为） 3）。也许这就是我要找的。但在scala语言中，我对此一无所知。我正

我想使用ApacheSpark从HDFS读取/写入协议缓冲区消息。我找到了以下建议方法：

1）使用Google的Gson库将protobuf消息转换为Json，然后通过SparkSql读/写它们。此解决方案在中进行了解释，但我认为这样做（转换为json）是一项额外的任务

2）转换为拼花文件。有和github项目可以这样做，但我不想要拼花文件，因为我总是处理所有列（而不是某些列），这样拼花格式不会给我任何好处（至少我认为）

3）。也许这就是我要找的。但在scala语言中，我对此一无所知。我正在寻找一个基于java的解决方案。介绍scalaPB并解释如何使用它（针对scala开发人员）

4）通过使用序列文件，这就是我所寻找的，但没有发现任何关于这一点。所以，我的问题是：如何将protobuf消息写入HDFS上的序列文件，并从中获取信息？任何其他建议都是有用的

5）通过twitter的图书馆

虽然这两点之间有点隐藏，但您似乎在问如何在spark中写入sequencefile。我找到了一个例子

// Importing org.apache.hadoop.io package
import org.apache.hadoop.io._

// As we need data in sequence file format to read. Let us see how to write first
// Reading data from text file format
val dataRDD = sc.textFile("/public/retail_db/orders")

// Using null as key and value will be of type Text while saving in sequence file format
// By Int and String, we do not need to convert types into IntWritable and Text
// But for others we need to convert to writable object
// For example, if the key/value is of type Long, we might have to 
// type cast by saying new LongWritable(object)
dataRDD.
  map(x => (NullWritable.get(), x)).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id

// Saving in sequence file with key of type Int and value of type String
dataRDD.
  map(x => (x.split(",")(0).toInt, x.split(",")(1))).
  saveAsSequenceFile("/user/`whoami`/orders_seq")
// Make sure to replace `whoami` with the appropriate OS user id