Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/symfony/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何在spark sql中对数组执行成员操作?_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 如何在spark sql中对数组执行成员操作?

Apache spark 如何在spark sql中对数组执行成员操作?,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,在spark sql中,我有一个列col的数据帧,其中包含一个大小为100的Int数组(例如) 我想将此列聚合为一个值,该值是一个大小为100的Int数组,其中包含列中每个元素的总和。 可以通过调用以下命令来执行此操作: dataframe.agg(functions.array((0 until 100).map(i => functions.sum(i)) : _*)) 这将生成代码,明确地进行100次聚合,然后将100个结果显示为100个项目的数组。但是,这似乎效率很低,因为如果我

在spark sql中,我有一个列
col
的数据帧,其中包含一个大小为100的Int数组(例如)

我想将此列聚合为一个值,该值是一个大小为100的Int数组,其中包含列中每个元素的总和。 可以通过调用以下命令来执行此操作:

dataframe.agg(functions.array((0 until 100).map(i => functions.sum(i)) : _*))
这将生成代码,明确地进行100次聚合,然后将100个结果显示为100个项目的数组。但是,这似乎效率很低,因为如果我的数组大小超过~1000个项目,catalyst甚至无法为此生成代码。 spark sql中是否有一个构造可以更有效地执行此操作? 理想情况下,应该可以在数组上自动传播
sum
聚合以进行成员级求和,但我在文档中没有找到任何与此相关的内容。 我的代码有哪些替代方案

编辑:我的回溯:

   ERROR codegen.CodeGenerator: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
    at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
    at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
    at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
    at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
    at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
    at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
    at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1002)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1069)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1066)
    at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
    at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
    at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:948)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:375)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1.apply(HashAggregateExec.scala:97)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1.apply(HashAggregateExec.scala:92)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:92)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1.apply(HashAggregateExec.scala:97)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$doExecute$1.apply(HashAggregateExec.scala:92)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doExecute(HashAggregateExec.scala:92)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:173)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:166)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:597)
    at com.criteo.enterprise.eligibility_metrics.RankingMetricsComputer$.runAndSaveMetrics(RankingMetricsComputer.scala:286)
    at com.criteo.enterprise.eligibility_metrics.RankingMetricsComputer$.main(RankingMetricsComputer.scala:366)
    at com.criteo.enterprise.eligibility_metrics.RankingMetricsComputer.main(RankingMetricsComputer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)

最好的方法是将嵌套数组转换为它们自己的行,这样就可以使用单个
groupBy
。通过这种方式,您可以在一个聚合中完成所有操作,而不是100个(或更多)。执行此操作的关键是使用
posexplode
,它将把数组中的每个条目转换为一个新行,其中包含它在数组中定位的索引

例如:

import org.apache.spark.sql.functions.{posexplode, collect_list}

val data = Seq(
    (Seq(1, 2, 3, 4, 5)),
    (Seq(2, 3, 4, 5, 6)),
    (Seq(3, 4, 5, 6, 7))
)

val df = data.toDF

val df2 = df.
    select(posexplode($"value")).
    groupBy($"pos").
    agg(sum($"col") as "sum")

// At this point you will have rows with the index and the sum
df2.orderBy($"pos".asc).show
将输出如下所示的数据帧:

+---+---+
|pos|sum|
+---+---+
|  0|  6|
|  1|  9|
|  2| 12|
|  3| 15|
|  4| 18|
+---+---+
df2.groupBy().agg(collect_list(struct($"pos", $"sum")) as "list").show
或者,如果你想让它们排成一行,你可以做如下广告:

+---+---+
|pos|sum|
+---+---+
|  0|  6|
|  1|  9|
|  2| 12|
|  3| 15|
|  4| 18|
+---+---+
df2.groupBy().agg(collect_list(struct($"pos", $"sum")) as "list").show
数组列中的值不会被排序,但是您可以编写一个UDF来按pos字段对其排序,如果您想这样做,可以删除pos字段

根据评论更新

如果上述方法不适用于您尝试执行的任何其他聚合,那么您需要定义自己的UDAF。这里的总体思路是告诉Spark如何在分区内组合同一个键的值以创建中间值,然后如何跨分区组合这些中间值以创建每个键的最终值。定义UDAF类后,您可以在
aggs
调用中使用该类,并将其与您想要执行的任何其他聚合一起使用

下面是我删掉的一个简单例子。请注意,它假定数组长度,可能应该做得更加防错,但应该让您在大部分方面做到这一点

import org.apache.spark.sql.Row
导入org.apache.spark.sql.types_
导入org.apache.spark.sql.expressions.MutableAggregationBuffer
导入org.apache.spark.sql.expressions.UserDefinedAggregateFunction
类ArrayCombine扩展了UserDefinedAggregateFunction{
//此聚合将接收的输入(每行)
重写def inputSchema:org.apache.spark.sql.types.StructType=
StructType(StructField(“value”,ArrayType(IntegerType))::Nil)
//使用每行数据更新时的中间状态
重写def bufferSchema:StructType=StructType(
StructType(StructField(“value”,ArrayType(IntegerType))::Nil)
)
//这是聚合函数的输出类型。
覆盖def数据类型:数据类型=ArrayType(IntegerType)
覆盖def deterministic:Boolean=true
//这是缓冲区架构的初始值。
覆盖def初始化(缓冲区:MutableAggregationBuffer):单位={
缓冲区(0)=(0到100)。toArray
}
//给定一个新的输入行,更新我们的状态
覆盖def更新(缓冲区:MutableAggregationBuffer,输入:行):单位={
val sums=buffer.getSeq[Int](0)
val newVals=input.getSeq[Int](0)
缓冲区(0)=sums.zip(newVals).map{case(a,b)=>a+b}
}
//在计算完每个分区的中间值后,组合分区
覆盖def合并(buffer1:MutableAggregationBuffer,buffer2:Row):单位={
val sums1=buffer1.getSeq[Int](0)
val sums2=buffer2.getSeq[Int](0)
buffer1(0)=sums1.zip(sums2.map{case(a,b)=>a+b}
}
//这是您输出最终值的地方,给定了bufferSchema的最终值。
覆盖def求值(缓冲区:行):任意={
buffer.getSeq[Int](0)
}
}
那么就这样称呼它:

+---+---+
|pos|sum|
+---+---+
|  0|  6|
|  1|  9|
|  2| 12|
|  3| 15|
|  4| 18|
+---+---+
df2.groupBy().agg(collect_list(struct($"pos", $"sum")) as "list").show
val arrayUdaf=new ArrayCombine()
df.groupBy().agg(arrayUdaf($“value”)).show

很抱歉,我现在没有catalyst异常,但这与生成的代码太大/使用太多变量有关。如果有机会,请回答问题并附上回溯。这将有助于诊断问题,并找到可能的解决方案。此外,如果可以包含类型注释(什么是
dataframe
-
Dataset[\u]
RelationalGroupedDataset
?),那就太好了。就性能而言,您不会找到比聚合更好的解决方案。太好了。与数据帧的类型相关。
UserDefinedAggregateFunction
如果@lezebulon希望获得更好的性能,尤其是大型阵列,那么像这样的一个建议是非常糟糕的。链接方法更快,但是如果op需要同时运行多个AGG,在尝试更专业、更不灵活的方法之前,我会先尝试UDAF,看看它对他的数据有何影响。非常感谢,我今天不能再测试了,但我会报告我的结果tomorrow@user6910411我是否可以重复使用您链接的问题的答案,并在UDAF中使用原语类型代替ArrayType?如果是这样的话,在保持UDAF的同时它不应该更快吗?@lezebulon你可以尝试一下,不过如果你有一些催化剂问题,它可能不会改变任何事情,即使它改变了,它也只不过是权宜之计。出于好奇-跳过
数组时,代码是否正常工作?