Apache spark 密集向量列到稀疏向量列

Apache spark 密集向量列到稀疏向量列,apache-spark,pyspark,Apache Spark,Pyspark,我有一个独特的情况,我需要从DenseVector到稀疏向量列 我正在尝试实现我在这里找到的SMOTE技术: 但在第44行,由于一个错误,我不得不将它从minu数组[neigh][0]-minu数组[I][0]更改为DenseVector(minu数组[neigh][0])-DenseVector(minu数组[I][0]) 一旦我有了DenseVector列,我需要将它转换回SparseVector列以合并我的数据 我尝试了以下方法: df = sc.parallelize([ (1, D

我有一个独特的情况,我需要从DenseVector到稀疏向量列

我正在尝试实现我在这里找到的SMOTE技术: 但在第44行,由于一个错误,我不得不将它从
minu数组[neigh][0]-minu数组[I][0]
更改为
DenseVector(minu数组[neigh][0])-DenseVector(minu数组[I][0])

一旦我有了DenseVector列,我需要将它转换回SparseVector列以合并我的数据

我尝试了以下方法:

df = sc.parallelize([
  (1, DenseVector([0.0, 1.0, 1.0, 2.0, 1.0, 3.0, 0.0, 0.0, 0.0, 0.0])),
  (2, DenseVector([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 100.0])),
  (3, DenseVector([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])),
]).toDF(["row_num", "features"])
int()参数必须是字符串、类似字节的对象或数字,而不是“DenseVector”


“不支持列功能的数据类型结构。”

将密集向量转换为稀疏向量通常没有太大意义,因为密集向量已经占用了内存。如果确实需要这样做,请查看稀疏向量API,它要么接受对列表(标记,值)或者您需要直接将非零索引和值传递给构造函数。如下所示:

来自pyspark.ml.linalg导入向量,VectorUDT
从pyspark.ml.linalg导入DenseVector
df=sc.parallelize([
(1,密度系数([0.0,1.0,1.0,2.0,1.0,3.0,0.0,0.0,0.0,0.0]),
(2,密度系数([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0]),
(3,密度系数([0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
]).toDF([“行数”,“特征”])
def到_稀疏(密集向量):
大小=len(密集向量)
pairs=[(i,v)表示枚举中的i,v(稠密_vector.values.tolist()),如果v!=0]
返回向量。稀疏(大小,对)
稠密\u到\u稀疏\u udf=udf(到\u稀疏,VectorUDT())
df=df.withColumn('features',稠密到稀疏的自定义项(df[“features”]))
df.show()
+-------+--------------------+
|行数特征|
+-------+--------------------+
|      1|(10,[1,2,3,4,5],[...|
|      2|    (10,[9],[100.0])|
|      3|      (10,[1],[1.0])|
+-------+--------------------+

非常感谢@Psidom。在您回复之前不久,我看到了这篇文章:,这也是另一个解决方案。
list_to_vector_udf = udf(lambda l: Vectors.sparse(l), VectorUDT())
df = df.withColumn('features', list_to_vector_udf(df["features"]))
assembler = VectorAssembler(inputCols=['features'],outputCol='features')
df = assembler.transform(df)