Python 创建sparsevector的熊猫UDF_Python_Pandas_Apache Spark_Pyspark_User Defined Functions

Python 创建sparsevector的熊猫UDF

python pandas apache-spark pyspark

Python 创建sparsevector的熊猫UDF,python,pandas,apache-spark,pyspark,user-defined-functions,Python,Pandas,Apache Spark,Pyspark,User Defined Functions,我试图定义一个pandas udf，它允许从一列字典创建sparsevector。下面是一个例子从pyspark.sql导入行从pyspark.ml.linalg导入SparseVector，VectorUDT 从pyspark.sql.functions导入* 从pyspark.sql.functions导入pandasuudf，PandasUDFType 从pyspark.sql.types导入* #创建示例数据 dff=spark.createDataFrame（[Row（featur

我试图定义一个pandas udf，它允许从一列字典创建

sparsevector

。下面是一个例子

从pyspark.sql导入行
从pyspark.ml.linalg导入SparseVector，VectorUDT
从pyspark.sql.functions导入*
从pyspark.sql.functions导入pandasuudf，PandasUDFType
从pyspark.sql.types导入*
#创建示例数据
dff=spark.createDataFrame（[Row（features=Row（index=[1,2]，size=10，value=[11,12]）），
行（特征=行（索引=[3,4]，大小=10，值=[13,14]），
行（特征=行（索引=[5,6,7]，大小=10，值=[15,16,17]））
])
打印（dff.printSchema（））
#访问结构中的值
dff.withColumn（'sparse'，col（'features'）['size']））

我可以访问features列中的单个键值对，因此我使用rdd.map创建

sparsevector

#使用rdd创建稀疏向量。map工作正常
dff.rdd.map（λx:SparseVector（x.features['size']），
x、 特征[“索引”]，
x、 功能['values']）。收集（）

我想在不使用rdd的情况下也这样做。我尝试使用

。with column

# trying using withColumn and SparseVector
dff.withColumn('sparse', SparseVector(col('features')['size'],
                                      col('features')['indices'],
                                      col('features')['values']))

但要避免错误

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Column'

我尝试在下面定义

udf

# create sparse vector using column of dictionaries and udfs
#@udf
#def create_s_vector(x):
#    return SparseVector(x['size'],x['indices'],x['values'])

# not sure whats the proper returnType
@pandas_udf(VectorUDT(), PandasUDFType.SCALAR)
def create_s_vector(x_iter):
    for x in x_iter:
        yield SparseVector(x['size'],x['indices'],x['values'])

# try using udf
dff.withColumn('sparse', create_s_vector(col('features')))

使用上面的代码，我得到一个错误，返回类型不受支持。谢谢大家!

不确定这是否只是一种误解，但当您使用

yield

时，返回类型是

PandasUDFType.SCALAR\u ITER

。迭代器pandas UDF似乎不是必需的，因为您可以简单地使用

create__vector_pdudf=pandas_UDF（create_vector，VectorUDT（））

将您的

create_s_vector

UDF初始化为pandasudf，使用列（'bla'，create____vector_pdf（'features'）。show（）@cronoik显然

pandas\u udf

不支持

VectorUDT（）

返回类型。很抱歉，我以前没有测试过。我认为你现在必须坚持普通的udf。