Apache spark Pyspark系列2以结构数组作为输入，结构数组作为输出的系列UDF_Apache Spark_Pyspark_Apache Spark Sql_User Defined Functions

Apache spark Pyspark系列2以结构数组作为输入，结构数组作为输出的系列UDF

apache-spark pyspark

Apache spark Pyspark系列2以结构数组作为输入，结构数组作为输出的系列UDF,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我如何在spark中构造一个UDF，它具有spark 3.0.1的嵌套（struct）输入和输出值注意：我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2，这是最近解决的问题。但是，spark尚未意识到（完全支持它）看起来像： root |-- meta_id: double (nullable = true) |-- time: timestamp (nullable = true) |-- category: string (nulla

我如何在spark中构造一个UDF，它具有spark 3.0.1的嵌套（struct）输入和输出值

注意：我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2，这是最近解决的问题。但是，spark尚未意识到（完全支持它）

看起来像：

root
 |-- meta_id: double (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- category: string (nullable = true)
 |-- value: long (nullable = true)
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- time_start: timestamp (nullable = true)
 |    |    |-- time_end: timestamp (nullable = true)
 |    |    |-- value_a: long (nullable = true)
 |    |    |-- value_b: double (nullable = true)


-------+-------------------+--------+-----+------------------------------------------------------------+
|meta_id|time               |category|value|metadata                                                    |
+-------+-------------------+--------+-----+------------------------------------------------------------+
|NaN    |2020-01-01 04:00:00|1       |7    |null                                                        |
|1.0    |2020-01-01 00:00:00|1       |5    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|2.0    |2020-01-01 03:00:00|1       |8    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|5.0    |2020-01-06 00:00:00|1       |2    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 7, 1.5619415E7]]|
+-------+-------------------+--------+-----+------------------------------------------------------------+

def s2s(time: pd.Series, metadata: pd.DataFrame) -> pd.DataFrame:
    """We must use DataFrames to represent the structs"""
    # iterate over all the timestamp start/end and test for overlap with the time column.
    # matching logic is not implemented for sake of brevity
    # instead (and to better debug only a loop which prints the contents of metadata)
    print(metadata)
    if metadata is not None:
        for m in metadata:
            print(m)
    print('***')
    return pd.DataFrame({'overlap': False, 'overlap_value_a': -1, 'overlap_value_b':-1}, index=[0])

from pyspark.sql.functions import col, pandas_udf
s2s = pandas_udf(s2s, returnType=StructType())

df.select(s2s(col("time"), col("metadata"))).show()

在以下情况下失败：

0    None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)

但我已经在检查内部是否存在空值-这里出了什么问题？

它可以工作-您只需使用pandas.DataFrame作为输出，并为生成的结构/数组映射适当的类型。

pd.DataFrame

映射到StructType，但元数据列是StructType的

数组，不认为当前的熊猫世界发展基金会支持这一点。您可能可以将数据类型从结构数组转换为字符串数组，例如：metadata=metadata.groupBy（“meta_id”）.agg（collect_set（concat_ws（“，”，“time_start”，“time_end”，“value_a”，“value_b”））。别名（“metadata”）
，然后在pandas中，将它们拆分为4个字段。另外，如果4列中有任何一列为空，则需要执行例如coalease（“time_start”），以便在拆分后字段正确对齐。我希望使用箭头2.x可以解决此问题-但您可能是对的，spark本身还没有充分了解这些新功能。它是否可以正常使用UDF？例如，没有箭头？是的，在常规UDF中，StructType转换为行
对象，MapType转换为dict
，ArrayType转换为列表。嵌套数据类型的组合也应该起作用。
0    None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)