Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在JAVA中连接Spark dataframe而不重复列_Java_Apache Spark_Apache Spark Sql_Spark Dataframe - Fatal编程技术网

如何在JAVA中连接Spark dataframe而不重复列

如何在JAVA中连接Spark dataframe而不重复列,java,apache-spark,apache-spark-sql,spark-dataframe,Java,Apache Spark,Apache Spark Sql,Spark Dataframe,如何合并2个数据帧而不合并重复的列 a.show() +-----+-------------------+--------+------+ | Name| LastTime|Duration|Status| +-----+-------------------+--------+------+ | Bob|2015-04-23 12:33:00| 1|logout| |Alice|2015-04-20 12:33:00| 5| login| +-

如何合并2个数据帧而不合并重复的列

a.show()

+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-23 12:33:00|       1|logout|
|Alice|2015-04-20 12:33:00|       5| login|
+-----+-------------------+--------+------+

b.show()
+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-24 00:33:00|       1|login |
+-----+-------------------+--------+------+
我想通过使用数据帧a中的整个数据来形成一个新的数据帧,但使用数据帧B中的数据来更新行

+-----+-------------------+--------+------+
| Name|           LastTime|Duration|Status|
+-----+-------------------+--------+------+
|  Bob|2015-04-24 00:33:00|       1|login |
|Alice|2015-04-20 12:33:00|       5| login|
+-----+-------------------+--------+------+
我能够在scala中加入并形成数据框架。但在JAVA中无法做到这一点

DataFrame f=a.join(b,a.col("Name").equalsTo(b.col("Name")).and a.col("LastTime).equalsTo(b.col("LastTime).and(a.col("Duration").equalsTo(b.col("Duration"),"outer")
执行这样的联接时,我得到了重复的列。

根据列名称的顺序,在Scala中解决此问题

汉斯,要么你干,要么你干。下面是您更正的示例代码:

DataFrame f = a.join(b,
    // Convert Java List to Scala Seq
    scala.collection.JavaConverters.asScalaIteratorConverter(
        Arrays.asList("Name", "LastTime", "Duration").iterator()
    ).asScala().toSeq(),
    "outer"
)

可以执行左半联接(“leftsemi”),以避免b数据集中的重复列

参考请参见此处:

正确的方法是:(已测试)

Dataset f=a.join(b,
//将Java列表转换为Scala Seq
JavaConverters.CollectionsCalateRableConverter(
asList(“名称”、“上次时间”、“持续时间”))
.asScala().toSeq(),
“外部”
)

我认为我们可以通过Spark SQL进行尝试,也可以通过java执行

spark.sql("""SELECT a.Name as Name,
CASE WHEN b.Name is null THEN a.LastTime ELSE b.LastTime END AS LastTime,
CASE WHEN b.Name is null THEN a.Duration ELSE b.Duration END AS Duration,
CASE WHEN b.Name is null THEN a.Status ELSE b.Status END AS Status 
FROM a a left outer join  b b on a.Name=b.Name 
""").show(false)

+-----+-------------------+--------+------+
|Name |LastTime           |Duration|Status|
+-----+-------------------+--------+------+
|Bob  |2015-04-24 00:33:00|1       |login |
|Alice|2015-04-20 12:33:00|5       |login |
+-----+-------------------+--------+------+
可以根据用例更新连接条件

spark.sql("""SELECT a.Name as Name,
CASE WHEN b.Name is null THEN a.LastTime ELSE b.LastTime END AS LastTime,
CASE WHEN b.Name is null THEN a.Duration ELSE b.Duration END AS Duration,
CASE WHEN b.Name is null THEN a.Status ELSE b.Status END AS Status 
FROM a a left outer join  b b on a.Name=b.Name 
""").show(false)

+-----+-------------------+--------+------+
|Name |LastTime           |Duration|Status|
+-----+-------------------+--------+------+
|Bob  |2015-04-24 00:33:00|1       |login |
|Alice|2015-04-20 12:33:00|5       |login |
+-----+-------------------+--------+------+