Scala 如何使用已更改的模式从Spark写入Kafka而不出现异常?

Scala 如何使用已更改的模式从Spark写入Kafka而不出现异常?,scala,apache-spark,apache-kafka,parquet,databricks,Scala,Apache Spark,Apache Kafka,Parquet,Databricks,我正在从Databricks向Spark加载拼花地板文件: val dataset = context.session.read().parquet(parquetPath) 然后我执行如下转换: val df = dataset.withColumn( columnName, concat_ws("", col(data.columnName), lit(textToAppend))) 当我试图将其保存为JSON到Kafka(而不是返回到拼花

我正在从Databricks向Spark加载拼花地板文件:

val dataset = context.session.read().parquet(parquetPath)
然后我执行如下转换:

val df = dataset.withColumn(
            columnName, concat_ws("",
            col(data.columnName), lit(textToAppend)))
当我试图将其保存为JSON到Kafka(而不是返回到拼花地板!):

我得到以下例外情况:

org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file dbfs:/mnt/warehouse/part-00001-tid-4198727867000085490-1e0230e7-7ebc-4e79-9985-0a131bdabee2-4-c000.snappy.parquet. Column: [item_group_id], Expected: StringType, Found: INT32
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:310)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
    at com.databricks.sql.io.parquet.NativeColumnReader.readBatch(NativeColumnReader.java:448)
    at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:330)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:167)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:299)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:287)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
只有在尝试读取多个分区时才会发生这种情况。例如,在
/mnt/warehouse/
目录中,我有很多拼花地板文件,每个文件都表示来自
日期戳的数据。如果我只读取其中一个,我不会得到异常,但如果我读取整个目录,则会发生此异常


我在进行转换时得到了这个结果,就像上面更改列的数据类型一样。我怎样才能解决这个问题?我不是想写回parquet,而是想把所有文件从同一个源模式转换成一个新的模式,然后把它们写回Kafka。

你可以找到这方面的说明


它向您展示了将数据写入卡夫卡主题的不同方式。

拼花地板文件似乎存在问题。文件中的
item\u group\u id
列并非都是相同的数据类型,有些文件将该列存储为String,而另一些文件将该列存储为Integer。从异常的源代码中,我们可以看到以下描述:

拼花读取器发现列类型不匹配时引发异常

可以在Spark on的测试中找到复制问题的简单方法:

当然,这只会发生在一次读取多个文件时,或者在上面的测试中附加了更多数据时。如果读取单个文件,则列的数据类型之间不会出现不匹配问题


解决此问题的最简单方法是在写入文件时确保所有文件的列类型正确

替代方法是分别读取所有拼花地板文件,更改模式以匹配,然后将它们与
联合
组合。一种简单的方法是调整模式:

// Specify the files and read as separate dataframes
val files = Seq(...)
val dfs = files.map(file => spark.read.parquet(file))

// Specify the schema (here the schema of the first file is used)
val schema = dfs.head.schema

// Create new columns with the correct names and types
val newCols = schema.map(c => col(c.name).cast(c.dataType))

// Select the new columns and merge the dataframes
val df = dfs.map(_.select(newCols: _*)).reduce(_ union _)

如果你想写卡夫卡主题,一个解决方案可能是使用producerGreat谢谢,这就是问题所在。源文件有错误的数据。谢谢你指出这一点!
Seq(("bcd", 2)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet(s"$path/parquet")
Seq((1, "abc")).toDF("a", "b").coalesce(1).write.mode("append").parquet(s"$path/parquet")

spark.read.parquet(s"$path/parquet").collect()
// Specify the files and read as separate dataframes
val files = Seq(...)
val dfs = files.map(file => spark.read.parquet(file))

// Specify the schema (here the schema of the first file is used)
val schema = dfs.head.schema

// Create new columns with the correct names and types
val newCols = schema.map(c => col(c.name).cast(c.dataType))

// Select the new columns and merge the dataframes
val df = dfs.map(_.select(newCols: _*)).reduce(_ union _)