Apache spark 如何交叉连接2数据帧?

Apache spark 如何交叉连接2数据帧?,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我正在努力获得2个数据帧的交叉连接。我正在使用spark 2.0。如何使用2个数据帧实现交叉连接 编辑: 在不使用连接条件的情况下,使用其他数据帧调用连接 请看下面的示例。 给定人员的第一个数据帧: +---+------+-------+------+ | id| name| mail|idArea| +---+------+-------+------+ | 1| Jack|j@j.com| 1| | 2|Valery|x@v.com| 1| | 3| Kar

我正在努力获得2个数据帧的交叉连接。我正在使用spark 2.0。如何使用2个数据帧实现交叉连接

编辑:


在不使用连接条件的情况下,使用其他数据帧调用连接

请看下面的示例。 给定人员的第一个数据帧:

+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|j@j.com|     1|
|  2|Valery|x@v.com|     1|
|  3|  Karl|k@k.com|     2|
|  4|  Nick|n@n.com|     2|
|  5|  Luke|l@f.com|     3|
|  6| Marek|a@b.com|     3|
+---+------+-------+------+
和区域的第二个数据帧:

+------+--------------+
|idArea|      areaName|
+------+--------------+
|     1|Amministration|
|     2|        Public|
|     3|         Store|
+------+--------------+
交叉连接简单地表示为:

val cross = people.join(area)
+---+------+-------+------+------+--------------+
| id|  name|   mail|idArea|idArea|      areaName|
+---+------+-------+------+------+--------------+
|  1|  Jack|j@j.com|     1|     1|Amministration|
|  1|  Jack|j@j.com|     1|     3|         Store|
|  1|  Jack|j@j.com|     1|     2|        Public|
|  2|Valery|x@v.com|     1|     1|Amministration|
|  2|Valery|x@v.com|     1|     3|         Store|
|  2|Valery|x@v.com|     1|     2|        Public|
|  3|  Karl|k@k.com|     2|     1|Amministration|
|  3|  Karl|k@k.com|     2|     2|        Public|
|  3|  Karl|k@k.com|     2|     3|         Store|
|  4|  Nick|n@n.com|     2|     3|         Store|
|  4|  Nick|n@n.com|     2|     2|        Public|
|  4|  Nick|n@n.com|     2|     1|Amministration|
|  5|  Luke|l@f.com|     3|     2|        Public|
|  5|  Luke|l@f.com|     3|     3|         Store|
|  5|  Luke|l@f.com|     3|     1|Amministration|
|  6| Marek|a@b.com|     3|     1|Amministration|
|  6| Marek|a@b.com|     3|     2|        Public|
|  6| Marek|a@b.com|     3|     3|         Store|
+---+------+-------+------+------+--------------+

升级至spark-sql_2.11 2.1.0版的最新版本并使用该函数。如果不需要指定任何条件,则使用数据集的交叉连接

以下是工作代码的摘录:

people.crossJoin(area).show()

您可能必须在spark confs中启用交叉连接。 例如:

然后用这样的方法:

df1.join(df2, <condition>)
df1.join(df2,)

如果区域数据很小,您可以通过
分解
而无需洗牌:

val df1 = Seq(
    (1,"Jack","j@j.com",1),
    (2,"Valery","x@v.com",1),
    (3,"Karl","k@k.com",2),
    (4,"Nick","n@n.com",2),
    (5,"Luke","l@f.com",3),
    (6,"Marek","a@b.com",3)
).toDF("id","name","mail","idArea")

val arr = array(
    Seq(
            (1,"Amministration"),
            (2,"Public"),
            (3,"Store")
        )
    .map(r => struct(lit(r._1).as("idArea"), lit(r._2).as("areaName"))):_*
)

val cross = df1
    .withColumn("d", explode(arr))
    .withColumn("idArea", $"d.idArea")
    .withColumn("areaName", $"d.areaName")
    .drop("d")

df1.show
cross.show
输出

+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|j@j.com|     1|
|  2|Valery|x@v.com|     1|
|  3|  Karl|k@k.com|     2|
|  4|  Nick|n@n.com|     2|
|  5|  Luke|l@f.com|     3|
|  6| Marek|a@b.com|     3|
+---+------+-------+------+

+---+------+-------+------+--------------+
| id|  name|   mail|idArea|      areaName|
+---+------+-------+------+--------------+
|  1|  Jack|j@j.com|     1|Amministration|
|  1|  Jack|j@j.com|     2|        Public|
|  1|  Jack|j@j.com|     3|         Store|
|  2|Valery|x@v.com|     1|Amministration|
|  2|Valery|x@v.com|     2|        Public|
|  2|Valery|x@v.com|     3|         Store|
|  3|  Karl|k@k.com|     1|Amministration|
|  3|  Karl|k@k.com|     2|        Public|
|  3|  Karl|k@k.com|     3|         Store|
|  4|  Nick|n@n.com|     1|Amministration|
|  4|  Nick|n@n.com|     2|        Public|
|  4|  Nick|n@n.com|     3|         Store|
|  5|  Luke|l@f.com|     1|Amministration|
|  5|  Luke|l@f.com|     2|        Public|
|  5|  Luke|l@f.com|     3|         Store|
|  6| Marek|a@b.com|     1|Amministration|
|  6| Marek|a@b.com|     2|        Public|
|  6| Marek|a@b.com|     3|         Store|
+---+------+-------+------+--------------+

向我们展示您的尝试…val df=df.join(df_t1,df(“Col1”)==df_t1(“col”)).join(df2,joinType==cross join”)。其中(df(“col2”)==df2(“col2”))数据帧现在有一个名为
crossJoin
的交叉连接方法
val df1 = Seq(
    (1,"Jack","j@j.com",1),
    (2,"Valery","x@v.com",1),
    (3,"Karl","k@k.com",2),
    (4,"Nick","n@n.com",2),
    (5,"Luke","l@f.com",3),
    (6,"Marek","a@b.com",3)
).toDF("id","name","mail","idArea")

val arr = array(
    Seq(
            (1,"Amministration"),
            (2,"Public"),
            (3,"Store")
        )
    .map(r => struct(lit(r._1).as("idArea"), lit(r._2).as("areaName"))):_*
)

val cross = df1
    .withColumn("d", explode(arr))
    .withColumn("idArea", $"d.idArea")
    .withColumn("areaName", $"d.areaName")
    .drop("d")

df1.show
cross.show
+---+------+-------+------+
| id|  name|   mail|idArea|
+---+------+-------+------+
|  1|  Jack|j@j.com|     1|
|  2|Valery|x@v.com|     1|
|  3|  Karl|k@k.com|     2|
|  4|  Nick|n@n.com|     2|
|  5|  Luke|l@f.com|     3|
|  6| Marek|a@b.com|     3|
+---+------+-------+------+

+---+------+-------+------+--------------+
| id|  name|   mail|idArea|      areaName|
+---+------+-------+------+--------------+
|  1|  Jack|j@j.com|     1|Amministration|
|  1|  Jack|j@j.com|     2|        Public|
|  1|  Jack|j@j.com|     3|         Store|
|  2|Valery|x@v.com|     1|Amministration|
|  2|Valery|x@v.com|     2|        Public|
|  2|Valery|x@v.com|     3|         Store|
|  3|  Karl|k@k.com|     1|Amministration|
|  3|  Karl|k@k.com|     2|        Public|
|  3|  Karl|k@k.com|     3|         Store|
|  4|  Nick|n@n.com|     1|Amministration|
|  4|  Nick|n@n.com|     2|        Public|
|  4|  Nick|n@n.com|     3|         Store|
|  5|  Luke|l@f.com|     1|Amministration|
|  5|  Luke|l@f.com|     2|        Public|
|  5|  Luke|l@f.com|     3|         Store|
|  6| Marek|a@b.com|     1|Amministration|
|  6| Marek|a@b.com|     2|        Public|
|  6| Marek|a@b.com|     3|         Store|
+---+------+-------+------+--------------+