Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/330.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark:Pad Array[Int]列带零_Python_Dataframe_Pyspark - Fatal编程技术网

Python Pyspark:Pad Array[Int]列带零

Python Pyspark:Pad Array[Int]列带零,python,dataframe,pyspark,Python,Dataframe,Pyspark,我在pyspark数据帧中有以下列,类型为Array[Int] +--------------------+ | feature_indices| +--------------------+ | [0]| |[0, 1, 4, 10, 11,...| | [0, 1, 2]| | [1]| | [0]| +--------------------+ 我试图用零填充数组,

我在pyspark数据帧中有以下列,类型为Array[Int]

+--------------------+
|     feature_indices|
+--------------------+
|                 [0]|
|[0, 1, 4, 10, 11,...|
|           [0, 1, 2]|
|                 [1]|
|                 [0]|
+--------------------+
我试图用零填充数组,然后限制列表长度,以便每行数组的长度相同。例如,对于n=5,我期望:

+--------------------+
|     feature_indices|
+--------------------+
|     [0, 0, 0, 0, 0]|
|   [0, 1, 4, 10, 11]|
|     [0, 1, 2, 0, 0]|
|     [1, 0, 0, 0, 0]|
|     [0, 0, 0, 0, 0]|
+--------------------+

有什么建议吗?我查看了pyspark
rpad
函数,但它只对字符串类型的列进行操作。

您可以编写一个
udf
来执行此操作:

from pyspark.sql.types import ArrayType, IntegerType
import pyspark.sql.functions as F

pad_fix_length = F.udf(
    lambda arr: arr[:5] + [0] * (5 - len(arr[:5])), 
    ArrayType(IntegerType())
)

df.withColumn('feature_indices', pad_fix_length(df.feature_indices)).show()
+-----------------+
|  feature_indices|
+-----------------+
|  [0, 0, 0, 0, 0]|
|[0, 1, 4, 10, 11]|
|  [0, 1, 2, 0, 0]|
|  [1, 0, 0, 0, 0]|
|  [0, 0, 0, 0, 0]|
+-----------------+

我最近在Keras中使用了
pad_sequences
函数来做类似的事情。我不确定您的用例,因此这可能是一个不必要的大依赖项

无论如何,这里是指向函数文档的链接:

输出:

[[1 2 3]
 [1 2 0]
 [1 4 0]]

太好了,谢谢!我一直在努力正确地编写udf。如果我们不在udf中提供
ArrayType(IntegerType())
,该怎么办?
[[1 2 3]
 [1 2 0]
 [1 4 0]]