Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 从spark行获取spark列_Scala_Apache Spark_Avro_Spark Structured Streaming_Delta Lake - Fatal编程技术网

Scala 从spark行获取spark列

Scala 从spark行获取spark列,scala,apache-spark,avro,spark-structured-streaming,delta-lake,Scala,Apache Spark,Avro,Spark Structured Streaming,Delta Lake,我对Scala和Spark都是新手,所以我正在努力创建一个映射函数。 数据框a行()上的映射函数 我一直在松散地关注这篇文章 from_avro的函数想要接受一个列(),但是我在文档中看不到从行中获取列的方法 我完全接受这样一种观点,即我可能整件事都做错了。 最终,我的目标是解析从。 解析的记录被写入一个增量表a,失败的记录被写入另一个增量表B 对于上下文,源表如下所示: 编辑-从_avro在“坏记录”上返回空值 有一些评论说,如果解析“坏记录”失败,那么来自_avro的将返回null。默认情

我对Scala和Spark都是新手,所以我正在努力创建一个映射函数。 数据框a行()上的映射函数 我一直在松散地关注这篇文章

from_avro的
函数想要接受一个列(),但是我在文档中看不到从行中获取列的方法

我完全接受这样一种观点,即我可能整件事都做错了。 最终,我的目标是解析从。 解析的记录被写入一个增量表a,失败的记录被写入另一个增量表B

对于上下文,源表如下所示:

编辑-
从_avro
在“坏记录”上返回空值

有一些评论说,如果解析“坏记录”失败,那么来自_avro的
将返回null。默认情况下,来自_avro的
使用模式
FAILFAST
,如果解析失败,该模式将引发异常。如果将模式设置为
PERMISSIVE
,则返回模式形状的对象,但所有属性都为null(也不是特别有用…)。链接到

这是我最初的命令:

val parsedDf = filterValueDF.select($"topic", 
                                    $"partition", 
                                    $"offset", 
                                    $"timestamp", 
                                    $"timestampType", 
                                    $"valueSchemaId", 
                                    from_avro($"fixedValue", currentValueSchema.value, fromAvroOptions).as('parsedValue))
如果存在任何错误行,则作业将使用
org.apache.spark.sparkeexception:job aborted.

异常日志的一个片段:

Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
    ... 10 more
    Suppressed: java.lang.NullPointerException
        at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
        at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
        at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
        at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
        at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
        at org.apache.parquet.format.Util.write(Util.java:222)
        at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
        at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
        at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
        at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
        ... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
    at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
    at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
    at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
    at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
    ... 16 more


据我所知,您只需要为一行获取一列。您可能可以通过使用row.get()在特定索引处获取列值来实现这一点。为了从row对象获取特定列,您可以使用
row.get(i)
或将列名与
row.getAs[T](“columnName”)
一起使用。您可以检查Row类的详细信息

然后,您的代码将如下所示:

val rddWithExceptionHandling=filterValueDF.rdd.map{row:row=>
val binaryFixedValue=row.getSeq[Byte](6)//或row.getAs[Seq[Byte]](“fixedValue”)
val parsed=Try(来自_avro(binaryFixedValue,currentValueSchema.value,来自avroOptions))匹配{
案例成功(parsedValue)=>List(parsedValue,null)
案例失败(ex)=>列表(null,ex.toString)
}
Row.fromSeq(Row.toSeq.toList++已解析)
}
尽管在您的例子中,您实际上不需要进入map函数,因为在使用dataframeapi时,您必须使用基本的Scala类型。这就是为什么不能直接从
map
从_avro
调用
的原因,因为
类的实例只能与数据帧API组合使用,即:
df.select($“c1”)
,这里c1是的实例。要使用_avro
中的
,如您最初所想,只需键入:

filterValueDF.select(从_avro($“fixedValue”,currentValueSchema))中)
正如@mike已经提到的,如果来自_avro的
未能解析avro内容,则将返回null。最后,如果要将成功行与失败行分开,可以执行以下操作:

val包括故障df=filterValueDF.select(
来自_avro($“fixedValue”,currentValueSchema)作为“avro_res”)
.withColumn(“failed”,$“avro_res”.isNull)
val SUCCESS DF=包括失败SDF。其中($“失败”==false)
val failedDf=包括失败SDF。其中($“失败”==真)

请注意,代码没有经过测试。

我不确定我是否完全理解您的用例,但我会尝试停留在数据帧中(不将其转换为RDD),只应用基于列
fixedValue
和给定模式的
from_avro
方法。如果解析不起作用,from_avro函数应该返回空值。这意味着,你可以根据这个空值过滤你的数据帧,并将它们写入增量表B,而你将过滤结果的另一部分发送到增量表A。@mike你的建议就是我目前正在做的。但是,如果来自_avro的
遇到无法解析的行,则不会返回null,它使整个流媒体作业失败。请参阅更新的答案@mikeI查看您引用的行为是当模式允许时,而不是默认行为:我看到您引用的行为是当模式允许时,而不是默认行为:我知道它未经测试,但非常接近!!但是.isNull并不完全正确。你得到的是一个所有属性的结构,但是它们都是空的,所以$“failed”总是
false
,看起来我几乎应该创建一个java/scala函数来扩展
,使它从\u avro
变得更具可操作性。@Oliver你是对的,我错过这一点的借口。我有几个月的时间来使用这个特定的功能,但已经忘记了一些细节。我会相应地更新答案。@oliver关于过滤失败的问题你也是对的。似乎您需要对avro进行一些低级验证。也许您可以使用Spark avro库中的一些现有工具,其中有from_avro和to_avro live
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:111)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:732)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:291)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1615)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:300)
    ... 10 more
    Suppressed: java.lang.NullPointerException
        at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsOutputStream.write(NativeAzureFileSystem.java:1099)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
        at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
        at shaded.parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:107)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:252)
        at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:234)
        at org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1184)
        at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1051)
        at org.apache.parquet.format.FileMetaData.write(FileMetaData.java:949)
        at org.apache.parquet.format.Util.write(Util.java:222)
        at org.apache.parquet.format.Util.writeFileMetaData(Util.java:69)
        at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:757)
        at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:750)
        at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:135)
        at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.abort(FileFormatDataWriter.scala:84)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$3(FileFormatWriter.scala:297)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1626)
        ... 11 more
Caused by: java.lang.ArithmeticException: Unscaled value too large for precision
    at org.apache.spark.sql.types.Decimal.set(Decimal.scala:83)
    at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
    at org.apache.spark.sql.avro.AvroDeserializer.createDecimal(AvroDeserializer.scala:308)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16(AvroDeserializer.scala:177)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$16$adapted(AvroDeserializer.scala:174)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:336)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:332)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:354)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:351)
    at org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:75)
    at org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:89)
    at org.apache.spark.sql.avro.AvroDataToCatalyst.nullSafeEval(AvroDataToCatalyst.scala:101)
    ... 16 more