Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 火花avro镶木地板_Scala_Apache Spark_Spark Dataframe_Avro_Parquet - Fatal编程技术网

Scala 火花avro镶木地板

Scala 火花avro镶木地板,scala,apache-spark,spark-dataframe,avro,parquet,Scala,Apache Spark,Spark Dataframe,Avro,Parquet,我有一个avro格式的数据流(json编码),需要存储为拼花文件。我只能这样做 val df = sqc.read.json(jsonRDD).toDF() 并将df写为拼花地板 这里的模式是从json推断出来的。但是我已经有了avsc文件,我不想spark从json推断模式 通过上述方式,拼花文件将模式信息存储为StructType,而不是avro.record.type。是否也有一种存储avro模式信息的方法 SPARK-1.4.1您可以通过编程指定模式 // The schema is

我有一个avro格式的数据流(json编码),需要存储为拼花文件。我只能这样做

val df = sqc.read.json(jsonRDD).toDF()
并将df写为拼花地板

这里的模式是从json推断出来的。但是我已经有了avsc文件,我不想spark从json推断模式

通过上述方式,拼花文件将模式信息存储为StructType,而不是avro.record.type。是否也有一种存储avro模式信息的方法


SPARK-1.4.1

您可以通过编程指定模式

// The schema is encoded in a string
val schemaString = "name age"

// Import Row.
import org.apache.spark.sql.Row;

// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
请参阅:

spark avro然后使用模式类型指定avro类型,如下所示

  • Spark SQL类型->Avro类型
  • ByteType->int
  • ShortType->int
  • 分码类型->字符串
  • 二进制类型->字节
  • 时间戳类型->长
  • 结构类型->记录
您可以按如下方式写入Avro记录:

import com.databricks.spark.avro._

val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val df = Seq((2012, 8, "Batman", 9.8),
        (2012, 8, "Hero", 8.7),
        (2012, 7, "Robot", 5.5),
        (2011, 7, "Git", 2.0))
        .toDF("year", "month", "title", "rating")

df.write.partitionBy("year", "month").avro("/tmp/output")

最后使用了这个问题的答案

def getSparkSchemaForAvro(sqc: SQLContext, avroSchema: Schema): StructType = {
    val dummyFIle = File.createTempFile("avro_dummy", "avro")
    val datumWriter = new GenericDatumWriter[wuser]()
    datumWriter.setSchema(avroSchema)
    val writer = new DataFileWriter(datumWriter).create(avroSchema, dummyFIle)
    writer.flush()
    writer.close()
    val df = sqc.read.format("com.databricks.spark.avro").load(dummyFIle.getAbsolutePath)
    df.schema
}