Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Pyspark系列2以结构数组作为输入,结构数组作为输出的系列UDF_Apache Spark_Pyspark_Apache Spark Sql_User Defined Functions - Fatal编程技术网

Apache spark Pyspark系列2以结构数组作为输入,结构数组作为输出的系列UDF

Apache spark Pyspark系列2以结构数组作为输入,结构数组作为输出的系列UDF,apache-spark,pyspark,apache-spark-sql,user-defined-functions,Apache Spark,Pyspark,Apache Spark Sql,User Defined Functions,我如何在spark中构造一个UDF,它具有spark 3.0.1的嵌套(struct)输入和输出值 注意:我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2,这是最近解决的问题。但是,spark尚未意识到(完全支持它) 看起来像: root |-- meta_id: double (nullable = true) |-- time: timestamp (nullable = true) |-- category: string (nulla

我如何在spark中构造一个UDF,它具有spark 3.0.1的嵌套(struct)输入和输出值

注意:我知道旧版本的Arrow有一些局限性。这就是为什么我要使用conda强制安装pyarror>=2,这是最近解决的问题。但是,spark尚未意识到(完全支持它)

看起来像:

root
 |-- meta_id: double (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- category: string (nullable = true)
 |-- value: long (nullable = true)
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- time_start: timestamp (nullable = true)
 |    |    |-- time_end: timestamp (nullable = true)
 |    |    |-- value_a: long (nullable = true)
 |    |    |-- value_b: double (nullable = true)


-------+-------------------+--------+-----+------------------------------------------------------------+
|meta_id|time               |category|value|metadata                                                    |
+-------+-------------------+--------+-----+------------------------------------------------------------+
|NaN    |2020-01-01 04:00:00|1       |7    |null                                                        |
|1.0    |2020-01-01 00:00:00|1       |5    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|2.0    |2020-01-01 03:00:00|1       |8    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 6, 1.5619415E7]]|
|5.0    |2020-01-06 00:00:00|1       |2    |[[2020-08-12 04:29:24, 2020-08-12 10:22:23, 7, 1.5619415E7]]|
+-------+-------------------+--------+-----+------------------------------------------------------------+

def s2s(time: pd.Series, metadata: pd.DataFrame) -> pd.DataFrame:
    """We must use DataFrames to represent the structs"""
    # iterate over all the timestamp start/end and test for overlap with the time column.
    # matching logic is not implemented for sake of brevity
    # instead (and to better debug only a loop which prints the contents of metadata)
    print(metadata)
    if metadata is not None:
        for m in metadata:
            print(m)
    print('***')
    return pd.DataFrame({'overlap': False, 'overlap_value_a': -1, 'overlap_value_b':-1}, index=[0])

from pyspark.sql.functions import col, pandas_udf
s2s = pandas_udf(s2s, returnType=StructType())

df.select(s2s(col("time"), col("metadata"))).show()
在以下情况下失败:

0    None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)

但我已经在检查内部是否存在空值-这里出了什么问题?

它可以工作-您只需使用pandas.DataFrame作为输出,并为生成的结构/数组映射适当的类型。

pd.DataFrame
映射到StructType,但元数据列是StructType的
数组,不认为当前的熊猫世界发展基金会支持这一点。您可能可以将数据类型从结构数组转换为字符串数组,例如:
metadata=metadata.groupBy(“meta_id”).agg(collect_set(concat_ws(“,”,“time_start”,“time_end”,“value_a”,“value_b”))。别名(“metadata”)
,然后在pandas中,将它们拆分为4个字段。另外,如果4列中有任何一列为空,则需要执行例如
coalease(“time_start”),以便在拆分后字段正确对齐。我希望使用箭头2.x可以解决此问题-但您可能是对的,spark本身还没有充分了解这些新功能。它是否可以正常使用UDF?例如,没有箭头?是的,在常规UDF中,StructType转换为
对象,MapType转换为
dict
,ArrayType转换为
列表
。嵌套数据类型的组合也应该起作用。
0    None
Name: _1, dtype: object
None
***
ValueError: not enough values to unpack (expected 2, got 0)