Apache spark Pyspark dataframe-将元组数据转换为行_Apache Spark_Dataframe_Pyspark_Apache Spark Sql_Tuples

Apache spark Pyspark dataframe-将元组数据转换为行

apache-spark dataframe pyspark

Apache spark Pyspark dataframe-将元组数据转换为行,apache-spark,dataframe,pyspark,apache-spark-sql,tuples,Apache Spark,Dataframe,Pyspark,Apache Spark Sql,Tuples,我想基于这两个键将pyspark dataframe中的元组数据转换为行。给出了原始数据和预期输出模式： root |-- key_1: string (nullable = true) |-- key_2: string (nullable = true) |-- prod: string (nullable = true) key_1|key_2|prod cust1|order1|(p1,p2,) cust2|order2|(p1,p2,p3)

我想基于这两个键将pyspark dataframe中的元组数据转换为行。给出了原始数据和预期输出

模式：

    root
     |-- key_1: string (nullable = true)
     |-- key_2: string (nullable = true)
     |-- prod: string (nullable = true)

key_1|key_2|prod
cust1|order1|(p1,p2,)
cust2|order2|(p1,p2,p3)
cust3|order3|(p1,)

key_1|key_2|prod|category
cust1|order1|p1
cust1|order1|p2
cust1|order1|
cust2|order2|p1
cust2|order2|p2
cust2|order2|p3
cust3|order3|p1
cust3|order3|

原始数据：

    root
     |-- key_1: string (nullable = true)
     |-- key_2: string (nullable = true)
     |-- prod: string (nullable = true)

key_1|key_2|prod
cust1|order1|(p1,p2,)
cust2|order2|(p1,p2,p3)
cust3|order3|(p1,)

key_1|key_2|prod|category
cust1|order1|p1
cust1|order1|p2
cust1|order1|
cust2|order2|p1
cust2|order2|p2
cust2|order2|p3
cust3|order3|p1
cust3|order3|

预期输出：

    root
     |-- key_1: string (nullable = true)
     |-- key_2: string (nullable = true)
     |-- prod: string (nullable = true)

key_1|key_2|prod
cust1|order1|(p1,p2,)
cust2|order2|(p1,p2,p3)
cust3|order3|(p1,)

key_1|key_2|prod|category
cust1|order1|p1
cust1|order1|p2
cust1|order1|
cust2|order2|p1
cust2|order2|p2
cust2|order2|p3
cust3|order3|p1
cust3|order3|

Spark有一个名为“分解”的函数，允许将一行中的列表/数组分解为多行，完全符合您的要求

但根据您的模式，我们必须再添加一个步骤，将prod string列转换为数组类型

转换类型的示例代码

from pyspark.sql.functions import explode
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

def squared(s):
    # udf function, convert string (p1,p2,p3) to array [p1, p2, p3]
    items = s[1:-2]  # Not sure it is correct with your data, please double check
    return items.split(',')

# Register udf
squared_udf = udf(squared, ArrayType(StringType()))

# Apply udf to conver prod string to real array
df_2 = df.withColumn('prod_list', squared_udf('prod'))

# Explode prod_list
df_2.select(df.key_1, df.key_2, explode(df_2.prod_list)).show()

我已经测试过了，结果很好

+-----+------+---+
|key_1| key_2|col|
+-----+------+---+
|cust1|order1| p1|
|cust1|order1| p2|
|cust2|order2| p1|
|cust2|order2| p2|
|cust2|order2| p3|
|cust3|order3| p1|
+-----+------+---+

用样本数据

    data = [
        {'key_1': 'cust1', 'key_2': 'order1', 'prod': '(p1,p2,)'},
        {'key_1': 'cust2', 'key_2': 'order2', 'prod': '(p1,p2,p3,)'},
        {'key_1': 'cust3', 'key_2': 'order3', 'prod': '(p1,)'},
    ]

请将

df.printSchema（）

的输出添加到您的问题中。我已经编辑了我的问题并立即添加了模式。这很好，但我建议您尽可能避免UDF，因为性能问题以及简化代码和生活。这里的自定义项可以由Spark内置的

split

和

regexp\u extract

函数的组合来替代，例如

df.select（df.key\u 1，df.key\u 2，explode（split（regexp\u extract（df.prod，，，，，，，，）））.show（）