Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/mercurial/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 在Spark 2.0.1数据帧上执行内部联接时出错_Scala_Apache Spark_Spark Dataframe - Fatal编程技术网

Scala 在Spark 2.0.1数据帧上执行内部联接时出错

Scala 在Spark 2.0.1数据帧上执行内部联接时出错,scala,apache-spark,spark-dataframe,Scala,Apache Spark,Spark Dataframe,还有谁遇到过这个问题并对如何解决它有想法吗 我一直在尝试更新我的代码以使用Spark 2.0.1和Scala 2.11。在Spark 1.6.0和Scala 2.10中,一切都很顺利。我有一个直接的dataframe到dataframe的内部连接,它返回一个错误。数据来自AWS RDS Aurora。请注意,下面的foodataframe实际上是92列,而不是我显示的两列。即使只有两列,问题仍然存在 相关信息: 带有模式的数据帧1 foo.show() +-------------------

还有谁遇到过这个问题并对如何解决它有想法吗

我一直在尝试更新我的代码以使用Spark 2.0.1和Scala 2.11。在Spark 1.6.0和Scala 2.10中,一切都很顺利。我有一个直接的dataframe到dataframe的内部连接,它返回一个错误。数据来自AWS RDS Aurora。请注意,下面的foodataframe实际上是92列,而不是我显示的两列。即使只有两列,问题仍然存在

相关信息:

带有模式的数据帧1

foo.show()

+--------------------+------+
|      Transaction ID|   BIN|
+--------------------+------+
|               bbBW0|134769|
|               CyX50|173622|
+--------------------+------+

println(foo.printSchema())

root
|-- Transaction ID: string (nullable = true)
|-- BIN: string (nullable = true)
bar.show()

+--------------------+-----------------+-------------------+
|              TranId|       Amount_USD|     Currency_Alpha|
+--------------------+-----------------+-------------------+
|               bbBW0|            10.99|                USD|
|               CyX50|           438.53|                USD|
+--------------------+-----------------+-------------------+

println(bar.printSchema())

root
|-- TranId: string (nullable = true)
|-- Amount_USD: string (nullable = true)
|-- Currency_Alpha: string (nullable = true)
带有模式的数据帧2

foo.show()

+--------------------+------+
|      Transaction ID|   BIN|
+--------------------+------+
|               bbBW0|134769|
|               CyX50|173622|
+--------------------+------+

println(foo.printSchema())

root
|-- Transaction ID: string (nullable = true)
|-- BIN: string (nullable = true)
bar.show()

+--------------------+-----------------+-------------------+
|              TranId|       Amount_USD|     Currency_Alpha|
+--------------------+-----------------+-------------------+
|               bbBW0|            10.99|                USD|
|               CyX50|           438.53|                USD|
+--------------------+-----------------+-------------------+

println(bar.printSchema())

root
|-- TranId: string (nullable = true)
|-- Amount_USD: string (nullable = true)
|-- Currency_Alpha: string (nullable = true)
数据帧与解释的连接

val asdf = foo.join(bar, foo("Transaction ID") === bar("TranId"))
println(foo.join(bar, foo("Transaction ID") === bar("TranId")).explain())

== Physical Plan ==
*BroadcastHashJoin [Transaction ID#0], [TranId#202], Inner, BuildRight
:- *Scan JDBCRelation((SELECT

        ...
        I REMOVED A BUNCH OF LINES FROM THIS PRINT OUT
        ...

      ) as x) [Transaction ID#0,BIN#8] PushedFilters: [IsNotNull(Transaction ID)], ReadSchema: struct<Transaction ID:string,BIN:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
   +- *Filter isnotnull(TranId#202)
      +- InMemoryTableScan [TranId#202, Amount_USD#203, Currency_Alpha#204], [isnotnull(TranId#202)]
         :  +- InMemoryRelation [TranId#202, Amount_USD#203, Currency_Alpha#204], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         :     :  +- Scan ExistingRDD[TranId#202,Amount_USD#203,Currency_Alpha#204]
可以在此处看到完整的堆栈()


在我的代码或jdbc查询中,从数据库中提取数据时,我没有
ID不为NULL)
。我花了很多时间在谷歌上搜索,发现了一个commit for Spark,它在连接的查询计划中添加了空过滤器。这里是commit()

如果您尝试了以下方法,您会感到好奇

val dfRenamed = bar.withColumnRenamed("TranId", " Transaction ID")
val newDF = foo.join(dfRenamed, Seq("Transaction ID"), "inner")

好奇你是否尝试过以下方法

val dfRenamed = bar.withColumnRenamed("TranId", " Transaction ID")
val newDF = foo.join(dfRenamed, Seq("Transaction ID"), "inner")