Apache spark 从pyspark中的现有数据帧创建多个数据帧
我在Apache spark 从pyspark中的现有数据帧创建多个数据帧,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我在pyspark中有一个数据帧,如下所示 data = [{"B_ID": 'TEST', "Category": 'Category A', "ID": 1, "Value": 1}, {"B_ID": 'TEST', "Category": 'Category B', "ID": 2, "Value": 2},
pyspark
中有一个数据帧,如下所示
data = [{"B_ID": 'TEST', "Category": 'Category A', "ID": 1, "Value": 1},
{"B_ID": 'TEST', "Category": 'Category B', "ID": 2, "Value": 2},
{"B_ID": 'TEST', "Category": 'Category C', "ID": 3, "Value": None},
{"B_ID": 'TEST', "Category": 'Category D', "ID": 4, "Value": 3},
]
df = spark.createDataFrame(data)
df.show()
+----+----------+---+-----+
|B_ID| Category| ID|Value|
+----+----------+---+-----+
|TEST|Category A| 1| 1|
|TEST|Category B| 2| 2|
|TEST|Category C| 3| null|
|TEST|Category D| 4| 3|
+----+----------+---+-----+
import pyspark.sql.functions as f
from functools import reduce
value_1 = 'TEST_1'
# changing B_ID column values and ID column values
df1 = df.withColumn("B_ID", f.lit(value_1)).withColumn("id", f.lit(5))
df1.show()
+------+----------+---+-----+
| B_ID| Category| id|Value|
+------+----------+---+-----+
|TEST_1|Category A| 5| 1|
|TEST_1|Category B| 5| 2|
|TEST_1|Category C| 5| null|
|TEST_1|Category D| 5| 3|
+------+----------+---+-----+
value_2 = 'TESTING'
df2 = df.withColumn("B_ID", f.lit(value_2)).withColumn("id", f.col("id"))
df2.show()
+-------+----------+---+-----+
| B_ID| Category| id|Value|
+-------+----------+---+-----+
|TESTING|Category A| 1| 1|
|TESTING|Category B| 2| 2|
|TESTING|Category C| 3| null|
|TESTING|Category D| 4| 3|
+-------+----------+---+-----+
df3 = df.withColumn("B_ID", f.col("B_ID")).withColumn("id", f.lit("6"))
df3.show()
+----+----------+---+-----+
|B_ID| Category| id|Value|
+----+----------+---+-----+
|TEST|Category A| 6| 1|
|TEST|Category B| 6| 2|
|TEST|Category C| 6| null|
|TEST|Category D| 6| 3|
+----+----------+---+-----+
现在,从上面的数据框中,我想通过更改某些列中的列值来创建一些数据框
我做了如下的事情
data = [{"B_ID": 'TEST', "Category": 'Category A', "ID": 1, "Value": 1},
{"B_ID": 'TEST', "Category": 'Category B', "ID": 2, "Value": 2},
{"B_ID": 'TEST', "Category": 'Category C', "ID": 3, "Value": None},
{"B_ID": 'TEST', "Category": 'Category D', "ID": 4, "Value": 3},
]
df = spark.createDataFrame(data)
df.show()
+----+----------+---+-----+
|B_ID| Category| ID|Value|
+----+----------+---+-----+
|TEST|Category A| 1| 1|
|TEST|Category B| 2| 2|
|TEST|Category C| 3| null|
|TEST|Category D| 4| 3|
+----+----------+---+-----+
import pyspark.sql.functions as f
from functools import reduce
value_1 = 'TEST_1'
# changing B_ID column values and ID column values
df1 = df.withColumn("B_ID", f.lit(value_1)).withColumn("id", f.lit(5))
df1.show()
+------+----------+---+-----+
| B_ID| Category| id|Value|
+------+----------+---+-----+
|TEST_1|Category A| 5| 1|
|TEST_1|Category B| 5| 2|
|TEST_1|Category C| 5| null|
|TEST_1|Category D| 5| 3|
+------+----------+---+-----+
value_2 = 'TESTING'
df2 = df.withColumn("B_ID", f.lit(value_2)).withColumn("id", f.col("id"))
df2.show()
+-------+----------+---+-----+
| B_ID| Category| id|Value|
+-------+----------+---+-----+
|TESTING|Category A| 1| 1|
|TESTING|Category B| 2| 2|
|TESTING|Category C| 3| null|
|TESTING|Category D| 4| 3|
+-------+----------+---+-----+
df3 = df.withColumn("B_ID", f.col("B_ID")).withColumn("id", f.lit("6"))
df3.show()
+----+----------+---+-----+
|B_ID| Category| id|Value|
+----+----------+---+-----+
|TEST|Category A| 6| 1|
|TEST|Category B| 6| 2|
|TEST|Category C| 6| null|
|TEST|Category D| 6| 3|
+----+----------+---+-----+
现在,在创建数据帧之后,我想合并所有新创建的数据帧
我做了如下的事情
#要联合的数据帧列表
列表_df=[df1,df2,df3]
# union all the data frames
final_df = reduce(f.DataFrame.union, list_df)
final_df.show()
+-------+----------+---+-----+
| B_ID| Category| id|Value|
+-------+----------+---+-----+
| TEST_1|Category A| 5| 1|
| TEST_1|Category B| 5| 2|
| TEST_1|Category C| 5| null|
| TEST_1|Category D| 5| 3|
|TESTING|Category A| 1| 1|
|TESTING|Category B| 2| 2|
|TESTING|Category C| 3| null|
|TESTING|Category D| 4| 3|
| TEST|Category A| 6| 1|
| TEST|Category B| 6| 2|
| TEST|Category C| 6| null|
| TEST|Category D| 6| 3|
+-------+----------+---+-----+
我正在实现我想要的。但我想知道是否还有其他更好的方法来实现我的结果。这里有另一种使用内联爆炸的方法:
df2 = df.selectExpr(
'Category',
'Value',
"inline(array(('TEST_1' as B_ID, 5 as id), ('TESTING' as B_ID, id), (B_ID, 6 as id)))"
).select(df.columns)
df2.show()
+-------+----------+---+-----+
| B_ID| Category| ID|Value|
+-------+----------+---+-----+
| TEST_1|Category A| 5| 1|
|TESTING|Category A| 1| 1|
| TEST|Category A| 6| 1|
| TEST_1|Category B| 5| 2|
|TESTING|Category B| 2| 2|
| TEST|Category B| 6| 2|
| TEST_1|Category C| 5| null|
|TESTING|Category C| 3| null|
| TEST|Category C| 6| null|
| TEST_1|Category D| 5| 3|
|TESTING|Category D| 4| 3|
| TEST|Category D| 6| 3|
+-------+----------+---+-----+