Apache spark Pyspark系列2以结构数组作为输入,结构数组作为输出的系列UDF
我如何在spark中构造一个UDF,它具有spark 3.0.1的嵌套(struct)输入和输出值 注意:我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2,这是最近解决的问题。但是,spark尚未意识到(完全支持它) 看起来像:Apache spark Pyspark系列2以结构数组作为输入,结构数组作为输出的系列UDF,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我如何在spark中构造一个UDF,它具有spark 3.0.1的嵌套(struct)输入和输出值 注意:我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2,这是最近解决的问题。但是,spark尚未意识到(完全支持它) 看起来像: root |-- meta_id: double (nullable = true) |-- time: timestamp (nullable = true) |-- category: string (nulla
root
|-- meta_id: double (nullable = true)
|-- time: timestamp (nullable = true)
|-- category: string (nullable = true)
|-- value: long (nullable = true)
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- time_start: timestamp (nullable = true)
| | |-- time_end: timestamp (nullable = true)
| | |-- value_a: long (nullable = true)
| | |-- value_b: double (nullable = true)
-------+-------------------+--------+-----+------------------------------------------------------------+
|meta_id|time |category|value|metadata |
+-------+-------------------+--------+-----+------------------------------------------------------------+
|NaN |2020-01-01 04:00:00|1 |7 |null |
|1.0 |2020-01-01 00:00:00|1 |5 |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|2.0 |2020-01-01 03:00:00|1 |8 |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|5.0 |2020-01-06 00:00:00|1 |2 |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 7, 1.5619415E7]]|
+-------+-------------------+--------+-----+------------------------------------------------------------+
def s2s(time: pd.Series, metadata: pd.DataFrame) -> pd.DataFrame:
"""We must use DataFrames to represent the structs"""
# iterate over all the timestamp start/end and test for overlap with the time column.
# matching logic is not implemented for sake of brevity
# instead (and to better debug only a loop which prints the contents of metadata)
print(metadata)
if metadata is not None:
for m in metadata:
print(m)
print('***')
return pd.DataFrame({'overlap': False, 'overlap_value_a': -1, 'overlap_value_b':-1}, index=[0])
from pyspark.sql.functions import col, pandas_udf
s2s = pandas_udf(s2s, returnType=StructType())
df.select(s2s(col("time"), col("metadata"))).show()
在以下情况下失败:
0 None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)
但我已经在检查内部是否存在空值-这里出了什么问题?它可以工作-您只需使用pandas.DataFrame作为输出,并为生成的结构/数组映射适当的类型。
pd.DataFrame
映射到StructType,但元数据列是StructType的数组,不认为当前的熊猫世界发展基金会支持这一点。您可能可以将数据类型从结构数组转换为字符串数组,例如:metadata=metadata.groupBy(“meta_id”).agg(collect_set(concat_ws(“,”,“time_start”,“time_end”,“value_a”,“value_b”))。别名(“metadata”)
,然后在pandas中,将它们拆分为4个字段。另外,如果4列中有任何一列为空,则需要执行例如coalease(“time_start”),以便在拆分后字段正确对齐。我希望使用箭头2.x可以解决此问题-但您可能是对的,spark本身还没有充分了解这些新功能。它是否可以正常使用UDF?例如,没有箭头?是的,在常规UDF中,StructType转换为行
对象,MapType转换为dict
,ArrayType转换为列表
。嵌套数据类型的组合也应该起作用。
0 None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)