Dataframe pyspark数据帧修改列
我有如下的输入数据框,其中输入列是动态的,也就是说,它可以是n个数字,就像input1到input2一样Dataframe pyspark数据帧修改列,dataframe,pyspark,hive,pyspark-dataframes,Dataframe,Pyspark,Hive,Pyspark Dataframes,我有如下的输入数据框,其中输入列是动态的,也就是说,它可以是n个数字,就像input1到input2一样 +----+----+-------+------+------+ |dim1|dim2| byvar|input1|input2| +----+----+-------+------+------+ | 101| 102|MTD0001| 1| 10| | 101| 102|MTD0002| 2| 12| | 101| 102|MTD0003| 3|
+----+----+-------+------+------+
|dim1|dim2| byvar|input1|input2|
+----+----+-------+------+------+
| 101| 102|MTD0001| 1| 10|
| 101| 102|MTD0002| 2| 12|
| 101| 102|MTD0003| 3| 13|
想修改以下列,怎么可能
+----+----+-------+----------+------+
|dim1|dim2| byvar|TRAMS_NAME|values|
+----+----+-------+----------+------+
| 101| 102|MTD0001| input1| 1|
| 101| 102|MTD0001| input2| 10|
| 101| 102|MTD0002| input1| 2|
| 101| 102|MTD0002| input2| 12|
| 101| 102|MTD0003| input1| 3|
| 101| 102|MTD0003| input2| 13|
我使用了create_map spark方法,但这是一种硬编码的方法。实现相同功能的任何其他方法???
示例数据帧:
df.show() #added more columns to show code is dynamic
+----+----+-------+------+------+------+------+------+------+
|dim1|dim2| byvar|input1|input2|input3|input4|input5|input6|
+----+----+-------+------+------+------+------+------+------+
| 101| 102|MTD0001| 1| 10| 3| 6| 10| 13|
| 101| 102|MTD0002| 2| 12| 4| 8| 11| 14|
| 101| 102|MTD0003| 3| 13| 5| 9| 12| 15|
+----+----+-------+------+------+------+------+------+------+
from pyspark.sql import functions as F
df.withColumn("vals",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df.columns if x!=['dim1','dim2','byvar']]))))\
.select("dim1", "dim2","byvar","vals.*").withColumn("TRAMS_NAME", F.element_at("0",1))\
.withColumn("VALUES", F.element_at("0",2)).drop("0").show()
+----+----+-------+----------+------+
|dim1|dim2| byvar|TRAMS_NAME|VALUES|
+----+----+-------+----------+------+
| 101| 102|MTD0001| input1| 1|
| 101| 102|MTD0001| input2| 10|
| 101| 102|MTD0001| input3| 3|
| 101| 102|MTD0001| input4| 6|
| 101| 102|MTD0001| input5| 10|
| 101| 102|MTD0001| input6| 13|
| 101| 102|MTD0002| input1| 2|
| 101| 102|MTD0002| input2| 12|
| 101| 102|MTD0002| input3| 4|
| 101| 102|MTD0002| input4| 8|
| 101| 102|MTD0002| input5| 11|
| 101| 102|MTD0002| input6| 14|
| 101| 102|MTD0003| input1| 3|
| 101| 102|MTD0003| input2| 13|
| 101| 102|MTD0003| input3| 5|
| 101| 102|MTD0003| input4| 9|
| 101| 102|MTD0003| input5| 12|
| 101| 102|MTD0003| input6| 15|
+----+----+-------+----------+------+
对于Spark2.4+
,您可以使用explode
、数组zip
、数组
和元素
动态地获取两列。只要您的输入列的名称以“输入”开头,这将起作用。
df.show() #added more columns to show code is dynamic
+----+----+-------+------+------+------+------+------+------+
|dim1|dim2| byvar|input1|input2|input3|input4|input5|input6|
+----+----+-------+------+------+------+------+------+------+
| 101| 102|MTD0001| 1| 10| 3| 6| 10| 13|
| 101| 102|MTD0002| 2| 12| 4| 8| 11| 14|
| 101| 102|MTD0003| 3| 13| 5| 9| 12| 15|
+----+----+-------+------+------+------+------+------+------+
from pyspark.sql import functions as F
df.withColumn("vals",\
F.explode(F.arrays_zip(F.array([F.array(F.lit(x),F.col(x)) for x in df.columns if x!=['dim1','dim2','byvar']]))))\
.select("dim1", "dim2","byvar","vals.*").withColumn("TRAMS_NAME", F.element_at("0",1))\
.withColumn("VALUES", F.element_at("0",2)).drop("0").show()
+----+----+-------+----------+------+
|dim1|dim2| byvar|TRAMS_NAME|VALUES|
+----+----+-------+----------+------+
| 101| 102|MTD0001| input1| 1|
| 101| 102|MTD0001| input2| 10|
| 101| 102|MTD0001| input3| 3|
| 101| 102|MTD0001| input4| 6|
| 101| 102|MTD0001| input5| 10|
| 101| 102|MTD0001| input6| 13|
| 101| 102|MTD0002| input1| 2|
| 101| 102|MTD0002| input2| 12|
| 101| 102|MTD0002| input3| 4|
| 101| 102|MTD0002| input4| 8|
| 101| 102|MTD0002| input5| 11|
| 101| 102|MTD0002| input6| 14|
| 101| 102|MTD0003| input1| 3|
| 101| 102|MTD0003| input2| 13|
| 101| 102|MTD0003| input3| 5|
| 101| 102|MTD0003| input4| 9|
| 101| 102|MTD0003| input5| 12|
| 101| 102|MTD0003| input6| 15|
+----+----+-------+----------+------+
下面是使用stack()函数解决问题的另一种方法。当然,它可能会简单一些,但有一个限制,即必须显式地放置列名 希望这有帮助
# set your dataframe
df = spark.createDataFrame(
[(101, 102, 'MTD0001', 1, 10),
(101, 102, 'MTD0002', 2, 12),
(101, 102, 'MTD0003', 3, 13)],
['dim1', 'dim2', 'byvar', 'v1', 'v2']
)
df.show()
+----+----+-------+---+---+
|dim1|dim2| byvar| v1| v2|
+----+----+-------+---+---+
| 101| 102|MTD0001| 1| 10|
| 101| 102|MTD0002| 2| 12|
| 101| 102|MTD0003| 3| 13|
+----+----+-------+---+---+
result = df.selectExpr('dim1',
'dim2',
'byvar',
"stack(2, 'v1', v1, 'v2', v2) as (names, values)")
result.show()
+----+----+-------+-----+------+
|dim1|dim2| byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001| v1| 1|
| 101| 102|MTD0001| v2| 10|
| 101| 102|MTD0002| v1| 2|
| 101| 102|MTD0002| v2| 12|
| 101| 102|MTD0003| v1| 3|
| 101| 102|MTD0003| v2| 13|
+----+----+-------+-----+------+
如果我们想动态地设置要堆叠的列,我们只需要设置未更改的列,在您的示例中是dim1、dim2和byvar,并使用for循环创建堆栈语句
# set static columns
unaltered_cols = ['dim1', 'dim2', 'byvar']
# extract columns to stack
change_cols = [n for n in df.schema.names if not n in unaltered_cols]
cols_exp = ",".join(["'" + n + "'," + n for n in change_cols])
# create stack sentence
stack_exp = "stack(" + str(len(change_cols)) +',' + cols_exp + ") as (names, values)"
# print final expression
print(stack_exp)
# --> stack(2,'v1',v1,'v2',v2) as (names, values)
# apply transformation
result = df.selectExpr('dim1',
'dim2',
'byvar',
stack_exp)
result.show()
+----+----+-------+-----+------+
|dim1|dim2| byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001| v1| 1|
| 101| 102|MTD0001| v2| 10|
| 101| 102|MTD0002| v1| 2|
| 101| 102|MTD0002| v2| 12|
| 101| 102|MTD0003| v1| 3|
| 101| 102|MTD0003| v2| 13|
+----+----+-------+-----+------+
如果我们使用不同的数据帧运行相同的代码,您将获得所需的结果
df = spark.createDataFrame(
[(101, 102, 'MTD0001', 1, 10, 4),
(101, 102, 'MTD0002', 2, 12, 5),
(101, 102, 'MTD0003', 3, 13, 5)],
['dim1', 'dim2', 'byvar', 'v1', 'v2', 'v3']
)
# Re-run the code to create the stack_exp before!
result = df.selectExpr('dim1',
'dim2',
'byvar',
stack_exp)
result.show()
+----+----+-------+-----+------+
|dim1|dim2| byvar|names|values|
+----+----+-------+-----+------+
| 101| 102|MTD0001| v1| 1|
| 101| 102|MTD0001| v2| 10|
| 101| 102|MTD0001| v3| 4|
| 101| 102|MTD0002| v1| 2|
| 101| 102|MTD0002| v2| 12|
| 101| 102|MTD0002| v3| 5|
| 101| 102|MTD0003| v1| 3|
| 101| 102|MTD0003| v2| 13|
| 101| 102|MTD0003| v3| 5|
+----+----+-------+-----+------+
为什么r的前两行
TRAMS\u NAME
input1和input2以及其余的是value1和value2?它们不是都应该是value1还是value2?是的,这是我的错。我已经更新了请检查。@Mohammad Murtaza Hashmithank谢谢你,但是如果堆栈中有多个列,我们可以使用for循环传递值和列吗???我添加了代码来解决多个列的问题!谢谢,但是列是动态的,列标题也可能会更改。无论如何,我会尝试。要做到这一点,您只需将x.startswith('input')
替换为if x=['dim1'、'dim2'、'byvar']
。。我编辑过。