Apache spark Spark如何将[JSONObject]RDD到数据集
我正在从com.google.gson.JsonObject类型的元素的RDD中读取数据。试图将其转换为数据集,但不知道如何实现Apache spark Spark如何将[JSONObject]RDD到数据集,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我正在从com.google.gson.JsonObject类型的元素的RDD中读取数据。试图将其转换为数据集,但不知道如何实现 import com.google.gson.{JsonParser} import org.apache.hadoop.io.LongWritable import org.apache.spark.sql.{SparkSession} object tmp { class people(name: String, age: Long, phone: Stri
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.sql.{SparkSession}
object tmp {
class people(name: String, age: Long, phone: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val PairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd1 =PairRDD.map(element => element._2)
import spark.implicits._
//How to create Dataset as schema People from rdd1?
}
}
即使尝试打印rdd1元素也会抛出
object not serializable (class: org.apache.hadoop.io.LongWritable, value: 1)
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (1,{"name":"abc","age":23,"phone":"0208"}))
基本上,我从表中得到这个RDD[LongWritable,JsonParser],我想将它转换为Dataset,这样我就可以应用SQL进行转换
我在第二条记录中故意将phone保留为null,BigQuery不会为具有null值的元素返回任何内容。感谢您的澄清。您需要在kryo中将该类注册为可序列化的。下面是我的表演作品。我在spark shell中运行,因此必须销毁旧的上下文,并使用包含注册的Kryo类的配置创建新的spark上下文
import com.google.gson.{JsonParser}
import org.apache.hadoop.io.LongWritable
import org.apache.spark.SparkContext
sc.stop()
val conf = sc.getConf
conf.registerKryoClasses( Array(classOf[LongWritable], classOf[JsonParser] ))
conf.get("spark.kryo.classesToRegister")
val sc = new SparkContext(conf)
val parser = new JsonParser();
val jsonObject1 = parser.parse("""{"name":"abc","age":23,"phone":"0208"}""").getAsJsonObject()
val jsonObject2 = parser.parse("""{"name":"xyz","age":33}""").getAsJsonObject()
val pairRDD = sc.parallelize(List(
(new LongWritable(1l), jsonObject1),
(new LongWritable(2l), jsonObject2)
))
val rdd = pairRDD.map(element => element._2)
rdd.collect()
// res9: Array[com.google.gson.JsonObject] = Array({"name":"abc","age":23,"phone":"0208"}, {"name":"xyz","age":33})
val jsonstrs = rdd.map(e=>e.toString).collect()
val df = spark.read.json( sc.parallelize(jsonstrs) )
df.printSchema
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- phone: string (nullable = true)
谢谢Shoaib,我已经编辑了我的问题,如果这能提供更多的想法。