Apache spark 数据帧到数据集转换(scala)
我正在尝试将Kafka消息值解压到case类实例中。(我把信息放在另一边。) 此代码:Apache spark 数据帧到数据集转换(scala),apache-spark,apache-kafka,spark-structured-streaming,Apache Spark,Apache Kafka,Spark Structured Streaming,我正在尝试将Kafka消息值解压到case类实例中。(我把信息放在另一边。) 此代码: import ss.implicits._ import org.apache.spark.sql.functions._ val enc: Encoder[TextRecord] = Encoders.product[TextRecord] ss.udf.register("deserialize", (bytes: Array[Byte]) => { Def
import ss.implicits._
import org.apache.spark.sql.functions._
val enc: Encoder[TextRecord] = Encoders.product[TextRecord]
ss.udf.register("deserialize", (bytes: Array[Byte]) => {
DefSer.deserialize(bytes).asInstanceOf[TextRecord] }
)
val inputStream = ss.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.getString("bootstrap.servers"))
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
inputStream.printSchema
val records = inputStream
.selectExpr(s"deserialize(value) AS record")
records.printSchema
val rec2 = records.as(enc)
rec2.printSchema
生成此输出:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- record: struct (nullable = true)
| |-- eventTime: timestamp (nullable = true)
| |-- lineLength: integer (nullable = false)
| |-- windDirection: float (nullable = false)
| |-- windSpeed: float (nullable = false)
| |-- gustSpeed: float (nullable = false)
| |-- waveHeight: float (nullable = false)
| |-- dominantWavePeriod: float (nullable = false)
| |-- averageWavePeriod: float (nullable = false)
| |-- mWaveDirection: float (nullable = false)
| |-- seaLevelPressure: float (nullable = false)
| |-- airTemp: float (nullable = false)
| |-- waterSurfaceTemp: float (nullable = false)
| |-- dewPointTemp: float (nullable = false)
| |-- visibility: float (nullable = false)
| |-- pressureTendency: float (nullable = false)
| |-- tide: float (nullable = false)
当我到水池的时候
val debugOut = rec2.writeStream
.format("console")
.option("truncate", "false")
.start()
debugOut.awaitTermination()
catalyst抱怨:
Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`eventTime`' given input columns: [record];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
我尝试了很多方法来“拉上TextRecord”,方法是调用
rec2.map(r=>r.getAs[TextRecord](0))
,分解(“record”)
,等等,但是遇到了ClassCastExceptions
最简单的方法是直接将inputStream行实例映射到TextRecord,假设它是一个case类,使用map
功能
import ss.implicits._
val inputStream = ss.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.getString("bootstrap.servers"))
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
val records = inputStream.map(row =>
DefSer.deserialize(row.getAs[Array[Byte]]("value")).asInstanceOf[TextRecord]
)
记录
将直接成为数据集[TextRecord]
另外,只要您隐式导入SparkSession,您就不需要为您的case类提供编码器类,Scala将为您隐式地完成它。Hi@jasonerothin,今年我们在Crested Butte中错过了您。您是否能够执行类似于
selectExpr(s“反序列化(值)。*”
?Hi@JackLeow-遗憾的是,没有-明年,当然!原因:org.apache.spark.sql.catalyst.parser.ParseException:不匹配的输入“*”应为{'SELECT','FROM'。。。