Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 矢量化Spark内置函数_Python_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Python 矢量化Spark内置函数

Python 矢量化Spark内置函数,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,上下文: 我有一个大小为1000的稀疏向量的数据帧,我需要对它应用自然对数。 我想利用bult-in函数,它已经提供了spark,但它对向量不起作用。以下是文件: from pyspark.sql.functions import log1p Signature: log1p(col) Docstring: Computes the natural logarithm of the given value plus one. .. versionadded:: 1.4 File:

上下文:

我有一个大小为1000的稀疏向量的数据帧,我需要对它应用自然对数。 我想利用bult-in函数,它已经提供了spark,但它对向量不起作用。以下是文件:

from pyspark.sql.functions import log1p

Signature: log1p(col)
Docstring:
Computes the natural logarithm of the given value plus one.

.. versionadded:: 1.4
File:      /usr/spark-2.4.4/python/pyspark/sql/functions.py
Type:      function
Spark中有没有功能可以做这样的事情

from pyspark.sql.functions import log1p, col, udf
import numpy as np
import math

log = np.vectorize(math.log1p)

df.withColumn('log_features', log(col('features'))

我目前正在使用一个udf,但不是很有效

@udf(returnType=VectorUDT())
def UDF_calculate_Log_Features(vector):
    vector.values = [math.log1p(x) for x in vector.values]
    return vector

df.withColumn('features_log', UDF_calculate_log_features(col('features'))
以下是数据框:

df.select('features').show(5, 72) # 176M rows aprox.

## +------------------------------------------------------------------------+
## |                                                                features|
## +------------------------------------------------------------------------+
## |                                  (10000,[1677,2549,3891],[1.0,1.0,1.0])|
## |                                   (10000,[714,2212,2812],[1.0,1.0,1.0])|
## |                                  (10000,[2146,2815,7826],[1.0,1.0,1.0])|
## |(10000,[2049,5377,6014,7239,8395,8848,9318,9367],[1.0,1.0,1.0,1.0,1.0...|
## |       (10000,[4694,6597,7595,8545,8869,9187],[2.0,1.0,1.0,1.0,1.0,1.0])|
## +------------------------------------------------------------------------+