Apache spark Spark DataFrame:作为Avro编写时如何指定模式
我想使用提供的Avro模式而不是Spark的自动生成模式以Avro格式编写数据帧。如何告诉Spark在写入时使用我的自定义模式?在中应用修补程序后,我能够在写入时指定模式,如下所示:Apache spark Spark DataFrame:作为Avro编写时如何指定模式,apache-spark,spark-dataframe,spark-avro,Apache Spark,Spark Dataframe,Spark Avro,我想使用提供的Avro模式而不是Spark的自动生成模式以Avro格式编写数据帧。如何告诉Spark在写入时使用我的自定义模式?在中应用修补程序后,我能够在写入时指定模式,如下所示: df.write.option("forceSchema", myCustomSchemaString).avro("/path/to/outputDir") 希望下面的方法有帮助 import org.apache.spark.sql.types._ val schema = StructType( Stru
df.write.option("forceSchema", myCustomSchemaString).avro("/path/to/outputDir")
希望下面的方法有帮助
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")
Example:
Download data from below site. https://datasets.imdbws.com/
Download the movies data title.ratings.tsv.gz
Copy to below location. /home/cloudera/workspace/movies/title.ratings.tsv.gz
Start Spark-shell and type below command.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val title = sqlContext.read.text("file:///home/cloudera/Downloads/movies/title.ratings.tsv.gz")
scala> title.limit(5).show
+--------------------+
| value|
+--------------------+
|tconst averageRat...|
| tt0000001 5.8 1350|
| tt0000002 6.5 157|
| tt0000003 6.6 933|
| tt0000004 6.4 93|
+--------------------+
val titlerdd = title.rdd
case class Title(titleId:String, averageRating:Float, numVotes:Int)
val titlefirst = titlerdd.first
val titleMapped = titlerdd.filter(e=> e!=titlefirst).map(e=> {
val rowStr = e.getString(0)
val splitted = rowStr.split("\t")
val titleId = splitted(0).trim
val averageRating = scala.util.Try(splitted(1).trim.toFloat) getOrElse(0.0f)
val numVotes = scala.util.Try(splitted(2).trim.toInt) getOrElse(0)
Title(titleId, averageRating, numVotes)
})
val titleMappedDF = titleMapped.toDF
scala> titleMappedDF.limit(2).show
+---------+-------------+--------+
| titleId|averageRating|numVotes|
+---------+-------------+--------+
|tt0000001| 5.8| 1350|
|tt0000002| 6.5| 157|
+---------+-------------+--------+
import org.apache.spark.sql.types._
val schema = StructType( StructField("title", StringType, true) ::StructField("averageRating", DoubleType, false) ::StructField("numVotes", IntegerType, false) :: Nil)
titleMappedDF.write.option("avroSchema", schema.toString).avro("/home/cloudera/workspace/movies/avrowithschema")
“avroSchema”选项对无效。write
。它仅用于。由DefaultSource
读取。看见