Python 递归地为任意数量的数据帧添加缺少的数据帧列
我希望递归地将列添加到数量可变的PySpark数据帧中,直到它们共享相同的列。(添加的列将填充空值)。下面的函数适用于2个数据帧,我的问题是:这如何推广到任意数量的数据帧(2、3等) 我尝试过使用Python 递归地为任意数量的数据帧添加缺少的数据帧列,python,python-3.x,apache-spark,pyspark,apache-spark-sql,Python,Python 3.x,Apache Spark,Pyspark,Apache Spark Sql,我希望递归地将列添加到数量可变的PySpark数据帧中,直到它们共享相同的列。(添加的列将填充空值)。下面的函数适用于2个数据帧,我的问题是:这如何推广到任意数量的数据帧(2、3等) 我尝试过使用functools.reduce并将函数签名定义为*dfs,但我不确定如何从这里开始: def add_missing_col_r(*dfs): """Compare column names in dfs and insert missing columns wi
functools.reduce
并将函数签名定义为*dfs
,但我不确定如何从这里开始:
def add_missing_col_r(*dfs):
"""Compare column names in dfs and insert missing columns with null values recursively."""
return reduce(DataFrame.withColumn(lambda i : i for i in DataFrame.schema.names), dfs)
在这里使用lambda函数是个好主意,还是有更好的方法
我正在使用的测试数据帧:
# Test dataframes
df1 = spark.createDataFrame([(1, "foo1", "qux1"),
(2, "bar1", "quux1"),
(3, "baz1", "quuz1")],
("a", "b", "c"))
df2 = spark.createDataFrame([(4, "foo2"), (5, "baz2")], ("a", "c"))
df3 = spark.createDataFrame([("bar3", "bar3", "bar3", "bar3"),
("qux3", "quux3", "quuz3", "corge3"),
("grault3", "garply3", "waldo3", "fred3")
],
("b", "d", "e", "f")
)
我不确定这里减价是否合适。仅仅使用普通的python就可以了。如果你想让结果列按正确的顺序排列,那么请查看我之前对你的另一个问题的回答
dfs = [df1, df2, df3]
all_cols = set(sum([i.columns for i in dfs], []))
def add_missing_col_r(dfs):
return_dfs = []
for df in dfs:
missing_cols = all_cols - set(df.columns)
for i in sorted(missing_cols):
df = df.withColumn(i, lit(None).cast(StringType()))
return_dfs.append(df)
return return_dfs
new_dfs = add_missing_col_r(dfs)
[x.show() for x in new_dfs]
+---+----+-----+----+----+----+
| a| b| c| d| e| f|
+---+----+-----+----+----+----+
| 1|foo1| qux1|null|null|null|
| 2|bar1|quux1|null|null|null|
| 3|baz1|quuz1|null|null|null|
+---+----+-----+----+----+----+
+---+----+----+----+----+----+
| a| c| b| d| e| f|
+---+----+----+----+----+----+
| 4|foo2|null|null|null|null|
| 5|baz2|null|null|null|null|
+---+----+----+----+----+----+
+-------+-------+------+------+----+----+
| b| d| e| f| a| c|
+-------+-------+------+------+----+----+
| bar3| bar3| bar3| bar3|null|null|
| qux3| quux3| quuz3|corge3|null|null|
|grault3|garply3|waldo3| fred3|null|null|
+-------+-------+------+------+----+----+
我不确定这里减价是否合适。仅仅使用普通的python就可以了。如果你想让结果列按正确的顺序排列,那么请查看我之前对你的另一个问题的回答
dfs = [df1, df2, df3]
all_cols = set(sum([i.columns for i in dfs], []))
def add_missing_col_r(dfs):
return_dfs = []
for df in dfs:
missing_cols = all_cols - set(df.columns)
for i in sorted(missing_cols):
df = df.withColumn(i, lit(None).cast(StringType()))
return_dfs.append(df)
return return_dfs
new_dfs = add_missing_col_r(dfs)
[x.show() for x in new_dfs]
+---+----+-----+----+----+----+
| a| b| c| d| e| f|
+---+----+-----+----+----+----+
| 1|foo1| qux1|null|null|null|
| 2|bar1|quux1|null|null|null|
| 3|baz1|quuz1|null|null|null|
+---+----+-----+----+----+----+
+---+----+----+----+----+----+
| a| c| b| d| e| f|
+---+----+----+----+----+----+
| 4|foo2|null|null|null|null|
| 5|baz2|null|null|null|null|
+---+----+----+----+----+----+
+-------+-------+------+------+----+----+
| b| d| e| f| a| c|
+-------+-------+------+------+----+----+
| bar3| bar3| bar3| bar3|null|null|
| qux3| quux3| quuz3|corge3|null|null|
|grault3|garply3|waldo3| fred3|null|null|
+-------+-------+------+------+----+----+