如何在pyspark中传递数组列并将其转换为numpy数组

如何在pyspark中传递数组列并将其转换为numpy数组,numpy,pyspark,pyspark-dataframes,Numpy,Pyspark,Pyspark Dataframes,我有一个如下所示的数据框: from pyspark import SparkContext, SparkConf,SQLContext import numpy as np from scipy.spatial.distance import cosine from pyspark.sql.functions import lit,countDistinct,udf,array,struct import pyspark.sql.functions as F config = SparkCon

我有一个如下所示的数据框:

from pyspark import SparkContext, SparkConf,SQLContext
import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import lit,countDistinct,udf,array,struct
import pyspark.sql.functions as F
config = SparkConf("local")
sc = SparkContext(conf=config)
sqlContext=SQLContext(sc)

@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y)


df = sqlContext.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))
这会引发以下错误:

19/10/02 21:24:58错误执行者:第1.0阶段任务0.0中出现异常TID 1

net.razorvine.pickle.PickleException:的参数应为零 为numpy.dtype at构造ClassDict net.razorvine.pickle.objects.ClassDictConstructor.ConstructionClassDictConstructor.java:23 net.razorvine.pickle.Unpickler.load\u reduceUnpickler.java:707 at net.razorvine.pickle.Unpickler.dispatchUnpickler.java:175 at net.razorvine.pickle.Unpickler.loadUnpickler.java:99 at net.razorvine.pickle.Unpickler.loadsUnpickler.java:112


我不确定为什么不能将列表类型转换为numpy数组?非常感谢您的任何帮助

这与您的问题基本相同。您创建了一个udf,并告诉spark此函数将返回一个float,但返回的对象类型为numpy.float64

您可以通过调用项将numpy类型转换为python类型,如下所示:

import numpy as np
from scipy.spatial.distance import cosine
from pyspark.sql.functions import lit,countDistinct,udf,array,struct
import pyspark.sql.functions as F


@udf("float")
def myfunction(x):
    y=np.array([1,3,9])
    x=np.array(x)
    return cosine(x,y).item()


df = spark.createDataFrame([("doc_3",1,3,9), ("doc_1",9,6,0), ("doc_2",9,9,3) ]).withColumnRenamed("_1","doc").withColumnRenamed("_2","word1").withColumnRenamed("_3","word2").withColumnRenamed("_4","word3")


df2=df.select("doc", array([c for c in df.columns if c not in {'doc'}]).alias("words"))

df2=df2.withColumn("cosine",myfunction("words"))

df2.show(truncate=False)
输出:

+-----+---------+----------+ 
| doc |   words |   cosine | 
+-----+---------+----------+ 
|doc_3|[1, 3, 9]|      0.0 | 
|doc_1|[9, 6, 0]|0.7383323 | 
|doc_2|[9, 9, 3]|0.49496463| 
+-----+---------+----------+

我假设余弦返回一个numpy数组?如果是,请尝试cosinex,y.item。还请在示例代码中包含相关的导入。类型错误:预期为字符串或Unicode对象,非类型查找将余弦导入添加到您的问题中,我将进行查看。@cronoik x=np.arrayx此转换是否正确?我正在将pyspark列转换为numpy arrayYes,这是正确的。在执行过程中,x是特定行和列的单个值。