Scala 从spark行获取spark列
我对Scala和Spark都是新手,所以我正在努力创建一个映射函数。 数据框a行()上的映射函数 我一直在松散地关注这篇文章 from_avro的Scala 从spark行获取spark列,scala,apache-spark,avro,spark-structured-streaming,delta-lake,Scala,Apache Spark,Avro,Spark Structured Streaming,Delta Lake,我对Scala和Spark都是新手,所以我正在努力创建一个映射函数。 数据框a行()上的映射函数 我一直在松散地关注这篇文章 from_avro的函数想要接受一个列(),但是我在文档中看不到从行中获取列的方法 我完全接受这样一种观点,即我可能整件事都做错了。 最终,我的目标是解析从。 解析的记录被写入一个增量表a,失败的记录被写入另一个增量表B 对于上下文,源表如下所示: 编辑-从_avro在“坏记录”上返回空值 有一些评论说,如果解析“坏记录”失败,那么来自_avro的将返回null。默认情
函数想要接受一个列(),但是我在文档中看不到从行中获取列的方法
我完全接受这样一种观点,即我可能整件事都做错了。
最终,我的目标是解析从。
解析的记录被写入一个增量表a,失败的记录被写入另一个增量表B
对于上下文,源表如下所示:
编辑-从_avro
在“坏记录”上返回空值
有一些评论说,如果解析“坏记录”失败,那么来自_avro的将返回null。默认情况下,来自_avro的使用模式FAILFAST
,如果解析失败,该模式将引发异常。如果将模式设置为PERMISSIVE
,则返回模式形状的对象,但所有属性都为null(也不是特别有用…)。链接到
这是我最初的命令:
val parsedDf = filterValueDF.select($"topic",
$"partition",
$"offset",
$"timestamp",
$"timestampType",
$"valueSchemaId",
from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))
如果存在任何错误行,则作业将使用org.apache.spark.sparkeexception:job aborted.
异常日志的一个片段:
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
... 10 more
Suppressed: java.lang.NullPointerException
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
at org.apache.parquet.format.Util.write(Util.java:222)
at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
... 16 more
据我所知,您只需要为一行获取一列。您可能可以通过使用row.get()在特定索引处获取列值来实现这一点。为了从row对象获取特定列,您可以使用row.get(i)
或将列名与row.getAs[T](“columnName”)
一起使用。您可以检查Row类的详细信息
然后,您的代码将如下所示:
val rddWithExceptionHandling=filterValueDF.rdd.map{row:row=>
val binaryFixedValue=row.getSeq[Byte](6)//或row.getAs[Seq[Byte]](“fixedValue”)
val parsed=Try(来自_avro(binaryFixedValue,currentValueSchema.value,来自avroOptions))匹配{
案例成功(parsedValue)=>List(parsedValue,null)
案例失败(ex)=>列表(null,ex.toString)
}
Row.fromSeq(Row.toSeq.toList++已解析)
}
尽管在您的例子中,您实际上不需要进入map函数,因为在使用dataframeapi时,您必须使用基本的Scala类型。这就是为什么不能直接从map
从_avro
调用的原因,因为列
类的实例只能与数据帧API组合使用,即:df.select($“c1”)
,这里c1是的实例。要使用_avro
中的,如您最初所想,只需键入:
filterValueDF.select(从_avro($“fixedValue”,currentValueSchema))中)
正如@mike已经提到的,如果来自_avro的未能解析avro内容,则将返回null。最后,如果要将成功行与失败行分开,可以执行以下操作:
val包括故障df=filterValueDF.select(
来自_avro($“fixedValue”,currentValueSchema)作为“avro_res”)
.withColumn(“failed”,$“avro_res”.isNull)
val SUCCESS DF=包括失败SDF。其中($“失败”==false)
val failedDf=包括失败SDF。其中($“失败”==真)
请注意,代码没有经过测试。我不确定我是否完全理解您的用例,但我会尝试停留在数据帧中(不将其转换为RDD),只应用基于列fixedValue
和给定模式的from_avro
方法。如果解析不起作用,from_avro函数应该返回空值。这意味着,你可以根据这个空值过滤你的数据帧,并将它们写入增量表B,而你将过滤结果的另一部分发送到增量表A。@mike你的建议就是我目前正在做的。但是,如果来自_avro的遇到无法解析的行,则不会返回null,它使整个流媒体作业失败。请参阅更新的答案@mikeI查看您引用的行为是当模式允许时,而不是默认行为:我看到您引用的行为是当模式允许时,而不是默认行为:我知道它未经测试,但非常接近!!但是.isNull并不完全正确。你得到的是一个所有属性的结构,但是它们都是空的,所以$“failed”总是false
,看起来我几乎应该创建一个java/scala函数来扩展,使它从\u avro
变得更具可操作性。@Oliver你是对的,我错过这一点的借口。我有几个月的时间来使用这个特定的功能,但已经忘记了一些细节。我会相应地更新答案。@oliver关于过滤失败的问题你也是对的。似乎您需要对avro进行一些低级验证。也许您可以使用Spark avro库中的一些现有工具,其中有from_avro和to_avro live
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
... 10 more
Suppressed: java.lang.NullPointerException
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
at org.apache.parquet.format.Util.write(Util.java:222)
at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
... 16 more