无法在Scala Spark中合并两个数据帧
我一直在尝试将1个数据帧附加到Scala中的另一个DF。本例中的追加操作只是向现有列添加一个大小相同的新列-不涉及键匹配。两个数据帧的形状相同(仅5行1列)无法在Scala Spark中合并两个数据帧,scala,apache-spark,merge,Scala,Apache Spark,Merge,我一直在尝试将1个数据帧附加到Scala中的另一个DF。本例中的追加操作只是向现有列添加一个大小相同的新列-不涉及键匹配。两个数据帧的形状相同(仅5行1列) join()函数运行,我甚至可以获得模式,但是当我想显示新DF的所有值时,我得到了错误: scala> val outputModelDF1 = coefficients.join(tvalues) outputModelDF1: org.apache.spark.sql.DataFrame = [coefficients: doub
join()
函数运行,我甚至可以获得模式,但是当我想显示新DF的所有值时,我得到了错误:
scala> val outputModelDF1 = coefficients.join(tvalues)
outputModelDF1: org.apache.spark.sql.DataFrame = [coefficients: double, t-values: double]
scala> outputModelDF1.printSchema()
root
|-- coefficients: double (nullable = false)
|-- t-values: double (nullable = false)
scala> outputModelDF1.show()
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [value#359 AS coefficients#361]
+- LocalRelation [value#359]
and
Project [value#368 AS t-values#370]
+- LocalRelation [value#368]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1062)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
... 52 elided
你知道如何处理它,以及如何简单地将这两个DFs合并在一起吗
更新1
我应该说明我想要实现的输出的期望格式。请参阅下文:
+--------------------+--------------------+
| coefficients| t-values|
+--------------------+--------------------+
| -59525.0697785032| 1.8267249911295418|
| 6957.836000531959| 100.35507390273406|
| 314.2998010755629| -8.768588605222108|
|-0.37884289844065666| -0.4656738230173362|
| -1758.154438149325| -1758.154438149325|
+--------------------+--------------------+
更新2
不幸的是,以下使用withColumn()
的方法不起作用
scala> val outputModelDF1 = coefficients.withColumn("t-values", tvalues("t-values"))
org.apache.spark.sql.AnalysisException: resolved attribute(s) t-values#119 missing from coefficients#113 in operator !Project [coefficients#113, t-values#119 AS t-values#130];;
!Project [coefficients#113, t-values#119 AS t-values#130]
+- Project [value#111 AS coefficients#113]
+- LocalRelation [value#111]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2872)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1153)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1908)
... 52 elided
一种方法是使用
monoticallyincreasingid
在数据帧中为join
创建关键列:
val df1 = Seq(
(-59525.0697785032), (6957.836000531959), (314.2998010755629), (-0.37884289844065666), (-1758.154438149325)
).toDF("coefficients")
val df2 = Seq(
(1.8267249911295418), (100.35507390273406), (-8.768588605222108), (-0.4656738230173362), (10.48091833711012)
).toDF("t-values")
val df1R = df1.withColumn("rowid", monotonicallyIncreasingId)
val df2R = df2.withColumn("rowid", monotonicallyIncreasingId)
val dfJoined = df1R.join(df2R, Seq("rowid"))
dfJoined.show
+-----+--------------------+-------------------+
|rowid| coefficients| t-values|
+-----+--------------------+-------------------+
| 0| -59525.0697785032| 1.8267249911295418|
| 1| 6957.836000531959| 100.35507390273406|
| 2| 314.2998010755629| -8.768588605222108|
| 3|-0.37884289844065666|-0.4656738230173362|
| 4| -1758.154438149325| 10.48091833711012|
+-----+--------------------+-------------------+
一种方法是使用
monoticallyincreasingid
在数据帧中为join
创建关键列:
val df1 = Seq(
(-59525.0697785032), (6957.836000531959), (314.2998010755629), (-0.37884289844065666), (-1758.154438149325)
).toDF("coefficients")
val df2 = Seq(
(1.8267249911295418), (100.35507390273406), (-8.768588605222108), (-0.4656738230173362), (10.48091833711012)
).toDF("t-values")
val df1R = df1.withColumn("rowid", monotonicallyIncreasingId)
val df2R = df2.withColumn("rowid", monotonicallyIncreasingId)
val dfJoined = df1R.join(df2R, Seq("rowid"))
dfJoined.show
+-----+--------------------+-------------------+
|rowid| coefficients| t-values|
+-----+--------------------+-------------------+
| 0| -59525.0697785032| 1.8267249911295418|
| 1| 6957.836000531959| 100.35507390273406|
| 2| 314.2998010755629| -8.768588605222108|
| 3|-0.37884289844065666|-0.4656738230173362|
| 4| -1758.154438149325| 10.48091833711012|
+-----+--------------------+-------------------+
您正在执行SQL交叉联接,而不是附加两列together@cricket_007是的,我知道,从错误消息中很清楚,但我不想要交叉连接。请查看上面所需输出的更新。查看
withColumn
function@cricket_007谢谢,你有个好主意。下面的Leo C展示了一个工作示例。您正在执行SQL交叉连接,而不是附加两列together@cricket_007是的,我知道,从错误消息中很清楚,但我不想要交叉连接。请查看上面所需输出的更新。查看withColumn
function@cricket_007谢谢,你有个好主意。下面的Leo C展示了这个工作示例。谢谢,这很有效。我接受了你的解决方案。然而,我仍然希望有一个更好的方法来做这件事。我发现创建两个额外的DFs非常低效。知道Scala是否有类似于R的cbind()函数的功能吗?@simtim,不幸的是,我不知道Spark中使用Scala的R的cbind
有任何等价性。您可以考虑将数据帧转换为RDDS,执行<代码> zip < /代码>,如<代码> DF1.RDD zip DF2。RDD < /C> >,然后将结果返回到数据框。在这种情况下,我更喜欢使用CopnNo(),而不是将数据框转换为RDDs和后端。当然,在我的示例中,这些DFs非常小——我基本上是在清理一些ml算法的输出,但我想知道在大型DFs上使用withColumn()或zip方法有多有效。当然,这一定很慢而且内存不足。你说得对,withColumn
并不特别便宜。在您的情况下,如果适用的话,ML计算最好在结果数据帧中保留行标识列。谢谢,这很有效。我接受了你的解决方案。然而,我仍然希望有一个更好的方法来做这件事。我发现创建两个额外的DFs非常低效。知道Scala是否有类似于R的cbind()函数的功能吗?@simtim,不幸的是,我不知道Spark中使用Scala的R的cbind
有任何等价性。您可以考虑将数据帧转换为RDDS,执行<代码> zip < /代码>,如<代码> DF1.RDD zip DF2。RDD < /C> >,然后将结果返回到数据框。在这种情况下,我更喜欢使用CopnNo(),而不是将数据框转换为RDDs和后端。当然,在我的示例中,这些DFs非常小——我基本上是在清理一些ml算法的输出,但我想知道在大型DFs上使用withColumn()或zip方法有多有效。当然,这一定很慢而且内存不足。你说得对,withColumn
并不特别便宜。在您的情况下,如果适用的话,ML计算最好在结果数据帧中保留行标识列。