Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 合并数据帧中的列_Apache Spark_Pyspark_Merge_Apache Spark Sql - Fatal编程技术网

Apache spark 合并数据帧中的列

Apache spark 合并数据帧中的列,apache-spark,pyspark,merge,apache-spark-sql,Apache Spark,Pyspark,Merge,Apache Spark Sql,我使用了4列的DataFrame,并希望将前2列与新DataFrame中的最后2列合并 数据相同,顺序无关,必须保留任何重复项 import pyspark.sql.functions as F df = spark.createDataFrame([ ["This is line 1","xxxx12","This is line 5","hhhh29"], ["This is line

我使用了4列的DataFrame,并希望将前2列与新DataFrame中的最后2列合并

数据相同,顺序无关,必须保留任何重复项

import pyspark.sql.functions as F
        
df = spark.createDataFrame([
["This is line 1","xxxx12","This is line 5","hhhh29"],
["This is line 2","yyyy23","This is line 6","kkkk47"],
["This is line 3","zzzz64","This is line 7","llll88"],
["This is line 4","gggg37","This is line 8","ssss84"],
]).toDF("col_a", "col_b", "col_c", "col_d")
新数据帧:

+---------------+-------+
| col_1         |col_2  |
+-------------- +-------+
|This is line 1 |xxxx12 |
|This is line 5 |hhhh29 |
|This is line 2 |yyyy23 |
|This is line 6 |kkkk47 |
|This is line 3 |zzzz64 |
|This is line 7 |llll88 |
|This is line 4 |gggg37 |
|This is line 8 |ssss84 |
+---------------+-------+

如何完成此操作?

如果订单不重要,您可以使用
unionAll

df2 = df.selectExpr(
    "col_a as col_1", "col_b as col_2"
).unionAll(
    df.selectExpr("col_c as col_1", "col_d as col_2")
)

df2.show()
+--------------+------+
|         col_1| col_2|
+--------------+------+
|This is line 1|xxxx12|
|This is line 2|yyyy23|
|This is line 3|zzzz64|
|This is line 4|gggg37|
|This is line 5|hhhh29|
|This is line 6|kkkk47|
|This is line 7|llll88|
|This is line 8|ssss84|
+--------------+------+
或者您可以使用
堆栈
,它保持顺序:

df2 = df.selectExpr("stack(2, col_a, col_b, col_c, col_d) as (col_1, col_2)")

df2.show()
+--------------+------+
|         col_1| col_2|
+--------------+------+
|This is line 1|xxxx12|
|This is line 5|hhhh29|
|This is line 2|yyyy23|
|This is line 6|kkkk47|
|This is line 3|zzzz64|
|This is line 7|llll88|
|This is line 4|gggg37|
|This is line 8|ssss84|
+--------------+------+