Json spark streaming writestream问题
我试图从文本文件中的JSON记录创建一个动态模式,因为每个记录都有不同的模式。下面是我的代码Json spark streaming writestream问题,json,schema,spark-streaming,Json,Schema,Spark Streaming,我试图从文本文件中的JSON记录创建一个动态模式,因为每个记录都有不同的模式。下面是我的代码 import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}
object streamingexample {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = spark.readStream.textFile("C:\\Users\\sheol\\Desktop\\streaming")
val newdf11=df1
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
val df2 = df1.select(from_json($"value", schema_of_json(json_schema)).alias("value_new"))
val df3 = df2.select($"value_new.*")
df3.printSchema()
df3.writeStream
.option("truncate", "false")
.format("console")
.start()
.awaitTermination()
}
}
我得到以下错误。请提供有关如何修复代码的帮助。我试了很多。无法理解
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
样本数据:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
发生此异常是因为您试图在流启动之前访问流中的数据。问题在于df3.printSchema()请确保在流启动后调用此函数。正如您已经知道的,代码中的这条语句会导致代码出现问题
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
您可以通过以下不同的方式获取json模式
val dd: DataFrame = spark.read.json("C:\\Users\\sheol\\Desktop\\streaming")
dd.show()
/** you can use val df1 = spark.readStream.textFile(yourfile) also **/
val json_schema = dd.schema.json;
println(json_schema)
结果:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]}
您可以进一步细化到您的需求,我将把它留给您val json_schema=newdf11.select(“value”).collect().map(x=>x.get(0)).mkString(“,”)这是代码显示错误的代码行。我希望您的问题得到解决。在流开始之前,我们不允许访问数据。我现在明白了。然而,有没有更好的方法来看待这个问题?我可能需要一个structtype或传入JSON数据的字符串,以便在这里每次更新模式时使用。在我以前的项目中,我这样做是为了转换为structtype。val staticInputDS=spark.readStream.option(“header”,“true”).schema(Util.schema).csv(“hdfs://master:9000/user/hadoop/test*.csv)