PySpark合并行到列堆栈溢出错误_Pyspark_Row_Stack Overflow_Pyspark Dataframes

PySpark合并行到列堆栈溢出错误

pyspark

PySpark合并行到列堆栈溢出错误,pyspark,row,stack-overflow,pyspark-dataframes,Pyspark,Row,Stack Overflow,Pyspark Dataframes,我想要的（非常简单）：到我尝试过的一些代码： def add_列（当前类型、目标、值）：如果cur_typ==目标：返回值一无所获 schema=T.StructType（[T.StructField（“name”，T.StringType（），True）， T.StructField（“类型T”，T.StringType（），True）， T.StructField（“value”，T.IntegerType（），True）]）数据=[（“x”，“a”，3），（“x”，“b”，5

我想要的（非常简单）：

到

我尝试过的一些代码：

def add_列（当前类型、目标、值）：
如果cur_typ==目标：
返回值
一无所获
schema=T.StructType（[T.StructField（“name”，T.StringType（），True），
T.StructField（“类型T”，T.StringType（），True），
T.StructField（“value”，T.IntegerType（），True）]）
数据=[（“x”，“a”，3），（“x”，“b”，5），（“x”，“c”，7），（“y”，“a”，1），（“y”，“b”，2），
(y,c,4),(z,a,6),(z,b,2),(z,c,3)
df=ctx.spark_session.createDataFrame（ctx.spark_session.sparkContext.parallelize（数据），模式）
targets=[i.typeT代表df.select（“typeT”）.distinct（）.collect（）]
添加列=F.udf（添加列）
w=Window.partitionBy（'name'））
对于目标中的目标：
df=df.withColumn（目标，F.max（F.lit）（添加_列（df[“typeT”]、F.lit（目标），df[“value”]）））。超过（w））
df=df.drop（“typeT”，“value”）.dropDuplicates（）

在循环中使用

with column

是不好的，如果没有更多要添加的col

创建一个col数组，并选择它们，这将产生更好的性能

cols = [F.col("name")]
for target in targets:
    cols.append(F.max(F.lit(add_columns(df["typeT"], F.lit(target), df["value"]))).over(w).alias(target))
df = df.select(cols)

结果是相同的输出

+----+---+---+---+
|name|  c|  b|  a|
+----+---+---+---+
|   x|  7|  5|  3|
|   z|  3|  2|  6|
|   y|  4|  2|  1|
+----+---+---+---+