Scala 关于如何以编程方式从json文件开始创建自定义org.apache.spark.sql.types.StructType架构对象_Scala_Spark Csv

Scala 关于如何以编程方式从json文件开始创建自定义org.apache.spark.sql.types.StructType架构对象

scala

Scala 关于如何以编程方式从json文件开始创建自定义org.apache.spark.sql.types.StructType架构对象,scala,spark-csv,Scala,Spark Csv,我必须使用json文件中的信息创建一个自定义org.apache.spark.sql.types.StructType模式对象，json文件可以是任何内容，因此我在属性文件中对其进行了参数化以下是它在属性文件中的外观： //ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, a

我必须使用json文件中的信息创建一个自定义org.apache.spark.sql.types.StructType模式对象，json文件可以是任何内容，因此我在属性文件中对其进行了参数化

以下是它在属性文件中的外观：

//ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson)
schema.parquet=/Users/XXXX/Desktop/generated_schema.json
writing.mode=overwrite
separator=;
header=false

生成的文件_schema.json如下所示：

{"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}

所以，这就是我认为我可以解决它的方式：

val path: Path = new Path(mra_schema_parquet)
val fileSystem = path.getFileSystem(sc.hadoopConfiguration)
val inputStream: FSDataInputStream = fileSystem.open(path)
val schema_json = Stream.cons(inputStream.readLine(), Stream.continually( inputStream.readLine))

System.out.println("schema_json looks like "  + schema_json.head)

val mySchemaStructType :DataType = DataType.fromJson(schema_json.head)

/*
After this line, mySchemaStructType have four StructFields objects inside it, the same than appears at schema_json
*/
logger.info(mySchemaStructType)

val myStructType = new StructType()
myStructType.add("mySchemaStructType",mySchemaStructType)

/*

After this line, myStructType have zero StructFields! here must be the bug, myStructType should have the four StructFields that represents the loaded schema json! this must be the error! but how can i construct the necessary StructType object?

*/

myDF = loadCSV(sqlContext, path_input_csv,separator,myStructType,header)
System.out.println("myDF.schema.json looks like " + myDF.schema.json)
inputStream.close()

df.write
  .format("com.databricks.spark.csv")
  .option("header", header)
  .option("delimiter",delimiter)
  .option("nullValue","")
  .option("treatEmptyValuesAsNulls","true")
  .mode(saveMode)
  .parquet(pathParquet)

当代码运行最后一行.parquet（pathParquet）时，发生异常：

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

此代码的输出如下所示：

16/11/11 13:57:04 INFO AnotherCSVtoParquet$: The job started using this propertie file: /Users/aisidoro/Desktop/mra-csv-converter/parametrizacion.properties
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_input_csv is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: path_output_parquet  is /Users/aisidoro/Desktop/output900000
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: mra_schema_parquet is /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: writting_mode is overwrite
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: separator is ;
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: header is false
16/11/11 13:57:05 INFO AnotherCSVtoParquet$: ATTENTION! aplying mra_schema_parquet  /Users/aisidoro/Desktop/mra-csv-converter/generated_schema.json
schema_json looks like {"type" : "struct","fields" : [ {"name" : "codigo","type" : "string","nullable" : true}, {"name":"otro", "type":"string", "nullable":true}, {"name":"vacio", "type":"string", "nullable":true},{"name":"final","type":"string","nullable":true} ]}
16/11/11 13:57:12 INFO AnotherCSVtoParquet$: StructType(StructField(codigo,StringType,true), StructField(otro,StringType,true), StructField(vacio,StringType,true), StructField(final,StringType,true))
 16/11/11 13:57:13 INFO AnotherCSVtoParquet$: loadCSV. header is false, inferSchema is false pathCSV is /Users/aisidoro/Desktop/mra-csv-converter/cds_glcs.csv separator is ;
 myDF.schema.json looks like {"type":"struct","fields":[]}

应该是schema_json对象和myDF.schema.json对象应该有相同的内容，不是吗？但事实并非如此。我认为这一定会引发错误

最后，作业会被压碎，但有一个例外：

**parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message root {
}**

事实上，如果我不提供任何json模式文件，那么作业执行得很好，但是使用此模式

有人能帮我吗？我只想从csv文件和json模式文件开始创建一些拼花文件

多谢各位

依赖项包括：

    <spark.version>1.5.0-cdh5.5.2</spark.version>
    <databricks.version>1.5.0</databricks.version>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.10</artifactId>
        <version>${spark.version}</version>
        <scope>compile</scope>
    </dependency>
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>${databricks.version}</version>
    </dependency>

1.5.0-cdh5.5.2
1.5.0
org.apache.spark
既然您说了自定义模式
，您可以这样做
val schema = (new StructType).add("field1", StringType).add("field2", StringType)
sqlContext.read.schema(schema).json("/json/file/path").show

此外，调查和分析
您可以创建嵌套的JSON模式，如下所示
例如：
{
  "field1": {
    "field2": {
      "field3": "create",
      "field4": 1452121277
    }
  }
}

val schema = (new StructType)
  .add("field1", (new StructType)
    .add("field2", (new StructType)
      .add("field3", StringType)
      .add("field4", LongType)
    )
  )

最后我发现了问题
问题出在这些方面：
val myStructType = new StructType()
myStructType.add("mySchemaStructType",mySchemaStructType)

我必须用这句话：
val mySchemaStructType = DataType.fromJson(schema_json.head).asInstanceOf[StructType]

我必须从数据类型转换StructType以使其正常工作。
谢谢Shankar，使用这种方法，我必须读取带有模式的子文件才能创建模式，但是，如果结构不是平面的，该怎么办？@aironman:更新了我的答案，您可以创建嵌套的json模式，就像链接中提供的模式一样。谢谢Shankar，我真的很感激，但我所知道的是，随着时间的推移，嵌套的json模式文件可能会有所不同，所以我不知道它将如何发展，模式是动态的。？在我解析json模式对象文件时，有没有更好的方法来创建这个模式对象？@aironman：我没有尝试过在没有模式的情况下读取嵌套的json，但它应该可以工作，我的意思是val df=sqlContext.read.json（“/json/file/path”）
，您不需要传递模式，但它仍然返回数据帧。