Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Join Spark:使用数组连接dataframe列_Join_Apache Spark - Fatal编程技术网

Join Spark:使用数组连接dataframe列

Join Spark:使用数组连接dataframe列,join,apache-spark,Join,Apache Spark,我有两个数据帧和两列 df1带有模式(键1:Long,Value) df2带有模式(键2:Array[Long],Value) 我需要在键列上连接这些数据帧(查找key1和key2中的值之间的匹配值)。但问题是它们的类型不同。有办法做到这一点吗 您可以强制转换key1和key2的类型,然后使用contains函数,如下所示 val df1 = sc.parallelize(Seq((1L,"one.df1"), (2L,"two.d

我有两个数据帧和两列

  • df1
    带有模式
    (键1:Long,Value)

  • df2
    带有模式
    (键2:Array[Long],Value)

我需要在键列上连接这些数据帧(查找
key1
key2
中的值之间的匹配值)。但问题是它们的类型不同。有办法做到这一点吗

您可以强制转换key1和key2的类型,然后使用contains函数,如下所示

val df1 = sc.parallelize(Seq((1L,"one.df1"), 
                             (2L,"two.df1"),      
                             (3L,"three.df1"))).toDF("key1","Value")  

DF1:
+----+---------+
|key1|Value    |
+----+---------+
|1   |one.df1  |
|2   |two.df1  |
|3   |three.df1|
+----+---------+

val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
                             (Array(2L,2L),"two.df2"),
                             (Array(3L,3L),"three.df2"))).toDF("key2","Value")
DF2:
+------+---------+
|key2  |Value    |
+------+---------+
|[1, 1]|one.df2  |
|[2, 2]|two.df2  |
|[3, 3]|three.df2|
+------+---------+

val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))

JOIN:
+----+---------+------+---------+
|key1|Value    |key2  |Value    |
+----+---------+------+---------+
|1   |one.df1  |[1, 1]|one.df2  |
|2   |two.df1  |[2, 2]|two.df2  |
|3   |three.df1|[3, 3]|three.df2|
+----+---------+------+---------+
最好的方法是使用
array\u contains
spark sql表达式,如下所示

import org.apache.spark.sql.functions.expr
import spark.implicits._

val df1 = Seq((1L,"one.df1"), (2L,"two.df1"),(3L,"three.df1")).toDF("key1","Value")

val df2 = Seq((Array(1L,1L),"one.df2"), (Array(2L,2L),"two.df2"), (Array(3L,3L),"three.df2")).toDF("key2","Value")

val joinedRDD = df1.join(df2, expr("array_contains(key2, key1)")).show

+----+---------+------+---------+
|key1|    Value|  key2|    Value|
+----+---------+------+---------+
|   1|  one.df1|[1, 1]|  one.df2|
|   2|  two.df1|[2, 2]|  two.df2|
|   3|three.df1|[3, 3]|three.df2|
+----+---------+------+---------+

请注意,您不能直接使用
org.apache.spark.sql.functions.array\u包含
函数,因为它要求第二个参数是文本,而不是列表达式。

df2中的key2必须包含df1中的key2?一种方法类似于分解数组[long],然后使用df1数据帧字符串“123”进行连接包含字符串“23”、“12”、“1”等。对字符串的转换将连接不应连接的内容。谢谢。代码在pyspark中工作。但是导入spark.implicits的目的是什么?我在
pyspark
import spark.implicits.\uu中找不到该模块,它在SCALA中使用,您在pyspark中不需要它