Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/278.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用numpy数组错误填充PySpark数据帧_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 使用numpy数组错误填充PySpark数据帧

Python 使用numpy数组错误填充PySpark数据帧,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,下面是我的Spark数据框的一个示例,下面是printSchema: +--------------------+---+------+------+--------------------+ | device_id|age|gender| group| apps| +--------------------+---+------+------+--------------------+ |-9073325454084204615| 24|

下面是我的Spark数据框的一个示例,下面是
printSchema

+--------------------+---+------+------+--------------------+
|           device_id|age|gender| group|                apps|
+--------------------+---+------+------+--------------------+
|-9073325454084204615| 24|     M|M23-26|                null|
|-8965335561582270637| 28|     F|F27-28|[1.0,1.0,1.0,1.0,...|
|-8958861370644389191| 21|     M|  M22-|[4.0,0.0,0.0,0.0,...|
|-8956021912595401048| 21|     M|  M22-|                null|
|-8910497777165914301| 25|     F|F24-26|                null|
+--------------------+---+------+------+--------------------+
only showing top 5 rows

root
 |-- device_id: long (nullable = true)
 |-- age: integer (nullle = true)
 |-- gender: string (nullable = true)
 |-- group: string (nullable = true)
 |-- apps: vector (nullable = true)
我试图用np.zero(19237)填充“apps”列中的空值。但是当我执行

df.fillna({'apps': np.zeros(19237)}))
我犯了一个错误

Py4JJavaError: An error occurred while calling o562.fill.
: java.lang.IllegalArgumentException: Unsupported value type java.util.ArrayList
或者如果我尝试

df.fillna({'apps': DenseVector(np.zeros(19237)})))
我犯了一个错误

AttributeError: 'numpy.ndarray' object has no attribute '_get_object_id'

有什么想法吗?

DataFrameNaFunctions
只支持本机(无UDT)类型的一个子集,因此这里需要一个UDF

from pyspark.sql.functions import coalesce, col, udf
from pyspark.ml.linalg import Vectors, VectorUDT

def zeros(n):
    def zeros_():
        return Vectors.sparse(n, {})
    return udf(zeros_, VectorUDT())()
用法示例:

df = spark.createDataFrame(
    [(1, Vectors.dense([1, 2, 3])), (2, None)],
    ("device_id", "apps"))

df.withColumn("apps", coalesce(col("apps"), zeros(3))).show()
+---------+-------------+
|设备| id |应用程序|
+---------+-------------+
|        1|[1.0,2.0,3.0]|
|        2|    (3,[],[])|
+---------+-------------+