Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧_Apache Spark_Pyspark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml

Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧

apache-spark pyspark

Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧,apache-spark,pyspark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,我有一个RDD和一组值（String，SparseVector），我想使用RDD创建一个数据帧。获取（label:string，features:vector）DataFrame，这是大多数ml算法库所需的模式。我知道这是可以做到的，因为当给定一个数据帧的features列时，ml库输出一个向量 temp_df=sqlContext.createDataFrame（temp_rdd，StructType([ StructField（“标签”，DoubleType（），False）， Struc

我有一个RDD和一组值（String，SparseVector），我想使用RDD创建一个数据帧。获取（label:string，features:vector）DataFrame，这是大多数ml算法库所需的模式。我知道这是可以做到的，因为当给定一个数据帧的features列时，ml库输出一个向量

temp_df=sqlContext.createDataFrame（temp_rdd，StructType([ StructField（“标签”，DoubleType（），False）， StructField（“标记”，ArrayType（StringType（）），False） ])) #假设存在RDD（双精度，数组（字符串）） hashingTF=hashingTF（numFeatures=compositions，inputCol=“tokens”，outputCol=“features”） ndf=哈希变换（temp_df） ndf.printSchema（） #输出 #根 #|--标签：双精度（nullable=false） #|--令牌：数组（nullable=false） #||--元素：字符串（containsnall=true） #|--特征：向量（可空=真）
所以我的问题是，我能不能让（RDDof（String，SparseVector）把它转换成（DataFrameof（String，vector））。我尝试了常用的
sqlContext.createDataFrame
，但是没有适合我需要的

df=sqlContext.createDataFrame（rdd，StructType([ StructField（“标签”，StringType（），True）， StructField（“功能”，类型（），True） ]))
您必须在此处使用
VectorUDT
：

# In Spark 1.x # from pyspark.mllib.linalg import SparseVector, VectorUDT from pyspark.ml.linalg import SparseVector, VectorUDT temp_rdd = sc.parallelize([ (0.0, SparseVector(4, {1: 1.0, 3: 5.5})), (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))]) schema = StructType([ StructField("label", DoubleType(), True), StructField("features", VectorUDT(), True) ]) temp_rdd.toDF(schema).printSchema() ## root ## |-- label: double (nullable = true) ## |-- features: vector (nullable = true)
仅为了完整性，Scala等效：

import org.apache.spark.sql.Row 导入org.apache.spark.rdd.rdd 导入org.apache.spark.sql.types.{DoubleType，StructType} //在Spark 1x中。 //导入org.apache.spark.mllib.linalg.{Vectors，VectorUDT} 导入org.apache.spark.ml.linalg.Vectors 导入org.apache.spark.ml.linalg.SQLDataTypes.VectorType val schema=new StructType（） .添加（“标签”，双重类型） //在Spark 1.x中 //.add（“功能”，新的VectorUDT（）） .添加（“功能”，矢量类型） val temp_rdd:rdd[行]=sc.parallelize（顺序( 行（0.0，Vectors.sparse（4，Seq（（1,1.0），（3,5.5）），行（1.0，Vectors.sparse（4，Seq（（0，-1.0），（2，0.5））） )) createDataFrame（temp_rdd，schema）.printSchema //根 //|--label:double（nullable=true） //|--特征：向量（nullable=true）
虽然@zero323 answer是有意义的，我希望它能为我工作-数据框架的底层rdd，sqlContext.createDataFrame（temp_rdd，schema），仍然包含SparseVector类型我必须执行以下操作才能转换为DenseVector类型-如果有人有我想知道的更短/更好的方法

temp_rdd = sc.parallelize([ (0.0, SparseVector(4, {1: 1.0, 3: 5.5})), (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))]) schema = StructType([ StructField("label", DoubleType(), True), StructField("features", VectorUDT(), True) ]) temp_rdd.toDF(schema).printSchema() df_w_ftr = temp_rdd.toDF(schema) print 'original convertion method: ',df_w_ftr.take(5) print('\n') temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray()))) print type(temp_rdd_dense), type(temp_rdd) print 'using map and toArray:', temp_rdd_dense.take(5) temp_rdd_dense.toDF().show() root |-- label: double (nullable = true) |-- features: vector (nullable = true) original convertion method: [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))] <class 'pyspark.rdd.PipelinedRDD'> <class 'pyspark.rdd.RDD'> using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)] +------------------+-----+ | features|label| +------------------+-----+ | [0.0,1.0,0.0,5.5]| 0.0| |[-1.0,0.0,0.5,0.0]| 1.0| +------------------+-----+

temp\u rdd=sc.parallelize([ （0.0，SparseVector（4，{1:1.0，3:5.5}）），（1.0，SparseVector（4，{0:-1.0，2:0.5}））） schema=StructType([ StructField（“标签”，DoubleType（），True）， StructField（“功能”，VectorUDT（），True） ]) temp_rdd.toDF（schema）.printSchema（） df_w_ftr=temp_rdd.toDF（模式）打印“原始转换方法：”，df_w_ftr.take（5）打印（“\n”） temp\u rdd\u densite=temp\u rdd.map（lambda x:Row（label=x[0]，features=DenseVector（x[1].toArray（）））打印类型（temp\u rdd\u density），类型（temp\u rdd）打印“使用映射和toArray:”，温度密度。取（5） temp_rdd_density.toDF（）.show（）根 |--标签：双精度（nullable=true） |--特征：向量（可空=真）原始转换方法：[行（label=0.0，features=SparseVector（4，{1:1.0，3:5.5}）），行（label=1.0，features=SparseVector（4，{0:-1.0，2:0.5}]）] 使用map和toArray:[行（要素=DenseVector（[0.0,1.0,0.0,5.5]），标签=0.0），行（要素=DenseVector（[-1.0,0.0,0.5,0.0]），标签=1.0] +------------------+-----+ |特征|标签| +------------------+-----+ | [0.0,1.0,0.0,5.5]| 0.0| |[-1.0,0.0,0.5,0.0]| 1.0| +------------------+-----+
这是scala中spark 2.1的一个示例

import org.apache.spark.ml.linalg.Vector def featuresRDD2DataFrame(features: RDD[Vector]): DataFrame = { import sparkSession.implicits._ val rdd: RDD[(Double, Vector)] = features.map(x => (0.0, x)) val df = rdd.toDF("label","features").select("features") df }

编译器在功能rdd上无法识别
toDF（）
，哇，我已经找了很久了！几乎是幸福的哭泣：，）这奏效了！非常感谢你！你能告诉我在文件里的什么地方吗？在linalg apache spark上找不到任何矢量Docs@OrangelMarquez可能需要拉取请求我不知道文档，但Spark source是一个有用的资源：