在Pyspark[non-df]中将多个数据帧合并为一个_Pyspark_Pyspark Dataframes

在Pyspark[non-df]中将多个数据帧合并为一个

pyspark

在Pyspark[non-df]中将多个数据帧合并为一个,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我将通过一个过程一个接一个地生成数据帧。我必须把它们合并成一个 +--------+----------+ | Name|Age | +--------+----------+ |Alex | 30| +--------+----------+ +--------+----------+ | Name|Age | +--------+----------+ |Earl | 32| +--------+--------

我将通过一个过程一个接一个地生成数据帧。我必须把它们合并成一个

+--------+----------+
|    Name|Age       |
+--------+----------+
|Alex    |        30|
+--------+----------+


+--------+----------+
|    Name|Age       |
+--------+----------+
|Earl    |        32|
+--------+----------+


+--------+----------+
|    Name|Age       |
+--------+----------+
|Jane    |        15|
+--------+----------+

最后：

+--------+----------+
|    Name|Age       |
+--------+----------+
|Alex    |        30|
+--------+----------+
|Earl    |        32|
+--------+----------+
|Jane    |        15|
+--------+----------+

尝试了许多选项，如concat、merge、append，但我想它们都是pandas库。我没有使用熊猫。使用Python2.7版和Spark 2.2版

编辑以涵盖foreachpartition的最终场景：

l = [('Alex', 30)]
k = [('Earl', 32)]

ldf = spark.createDataFrame(l, ('Name', 'Age'))
ldf = spark.createDataFrame(k, ('Name', 'Age'))
# option 1:
union_df(ldf).show()
#option 2:
uxdf = union_df(ldf)
uxdf.show()

两种情况下的输出：

+-------+---+
|   Name|Age|
+-------+---+
|Earl   | 32|
+-------+---+

您可以对数据帧使用

unionAll（）

：

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.union, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

unionAll(df1, df2, df3).show()

## +---+----+
## |  k|   v|
## +---+----+
## |  1|foo1|
## |  2|bar1|
## |  3|foo2|
## |  4|bar2|
## |  5|foo3|
## |  6|bar3|
## +---+----+

编辑：

您可以创建一个空数据帧，并继续对其进行联合：

# Create first dataframe
ldf = spark.createDataFrame(l, ["Name", "Age"])
ldf.show()
# Save it's schema
schema = ldf.schema
# Create an empty DF with the same schema, (you need to provide schema to create empty dataframe)
empty_df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
empty_df.show()
# Union the first DF with the empty df
empty_df = empty_df.union(ldf)
empty_df.show()
# New dataframe after some operations
ldf = spark.createDataFrame(k, schema)
# Union with the empty_df again
empty_df = empty_df.union(ldf)
empty_df.show()
# First DF ldf
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+

# Empty dataframe empty_df
+----+---+
|Name|Age|
+----+---+
+----+---+

# After first union empty_df.union(ldf)
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
+----+---+
# After second union with new ldf 
+----+---+
|Name|Age|
+----+---+
|Alex| 30|
|Earl| 32|
+----+---+

事实上，在您键入此答案的同时，我实现了相同的代码（可能是：）一个变化是，Python2.7中不推荐使用unionAll，并给出了警告。然而，工会也在这里工作。创建一个新的dfTried。它只保留最终df的值，而丢弃其他分区…分区也是数据帧，尝试使用上面的函数来泛化到分区，而不是dataframesUnable@pissall。你能给我一点建议吗