Python spark中混合数据的数组类型_Python_Apache Spark_Pyspark_User Defined Functions

Python spark中混合数据的数组类型

python apache-spark pyspark

Python spark中混合数据的数组类型,python,apache-spark,pyspark,user-defined-functions,Python,Apache Spark,Pyspark,User Defined Functions,我想将两个不同的数组列表合并为一个。每个数组都是spark数据帧中的一列。因此，我想使用udf def some_function(u,v): li = list() for x,y in zip(u,v): li.append(x.extend(y)) return li udf_object = udf(some_function,ArrayType(ArrayType(StringType())))) new_x = x.withColumn('new_name'

我想将两个不同的数组列表合并为一个。每个数组都是spark数据帧中的一列。因此，我想使用udf

def some_function(u,v):
  li = list()
  for x,y in zip(u,v):
      li.append(x.extend(y))
  return li

udf_object = udf(some_function,ArrayType(ArrayType(StringType()))))
new_x = x.withColumn('new_name',udf_object(col('name'),col('features')))

这是数据的模式：

root
 |-- blockingkey: string (nullable = true)
 |-- blocked_records: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- flattened_array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: float (containsNull = true)
 |-- name: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

我正在尝试合并名称和功能。因此，这就像名称中的第一个元素将与功能中的第一个元素合并一样。

但这仅在存在整数或浮点值时返回具有空值的数组。如果可以使用udf或其他方式解决此问题，请帮助我解决此问题。

如果您有

dataframe

和

schema

作为

+------------------------------------------------+----------------------------------------+
|features                                        |name                                    |
+------------------------------------------------+----------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|
+------------------------------------------------+----------------------------------------+

root
 |-- features: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- name: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

然后您可以定义

udf

函数，并将

udf

函数调用为

import pyspark.sql.types as t
from pyspark.sql import functions as f

def some_function(u,v):
    li = []
    for x, y in zip(u, v):
        li.append(x + y)
    return li

udf_object = f.udf(some_function,t.ArrayType(t.ArrayType(t.StringType())))

new_x = x.withColumn('new_name',udf_object(f.col('name'),f.col('features')))

因此，新的

+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|features                                        |name                                    |new_name                                                    |
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
|[WrappedArray(2.0, 3.0), WrappedArray(3.0, 5.0)]|[WrappedArray(a, b), WrappedArray(c, d)]|[WrappedArray(a, b, 2.0, 3.0), WrappedArray(c, d, 3.0, 5.0)]|
+------------------------------------------------+----------------------------------------+------------------------------------------------------------+

root
 |-- features: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)
 |-- name: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- new_name: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

我希望答案是有帮助的

你不能将float和string合并到一个数组中，两者应该是同一类型的如果我只是在li中附加x，那么它只会正确返回名称。但我想把名单扩大到y。