Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧_Apache Spark_Pyspark_Apache Spark Sql_Apache Spark Mllib_Apache Spark Ml - Fatal编程技术网

Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧

Apache spark 如何将具有SparseVector列的RDD转换为具有列作为向量的数据帧,apache-spark,pyspark,apache-spark-sql,apache-spark-mllib,apache-spark-ml,Apache Spark,Pyspark,Apache Spark Sql,Apache Spark Mllib,Apache Spark Ml,我有一个RDD和一组值(String,SparseVector),我想使用RDD创建一个数据帧。获取(label:string,features:vector)DataFrame,这是大多数ml算法库所需的模式。 我知道这是可以做到的,因为当给定一个数据帧的features列时,ml库输出一个向量 temp_df=sqlContext.createDataFrame(temp_rdd,StructType([ StructField(“标签”,DoubleType(),False), Struc

我有一个RDD和一组值(String,SparseVector),我想使用RDD创建一个数据帧。获取(label:string,features:vector)DataFrame,这是大多数ml算法库所需的模式。 我知道这是可以做到的,因为当给定一个数据帧的features列时,ml库输出一个向量

temp_df=sqlContext.createDataFrame(temp_rdd,StructType([
StructField(“标签”,DoubleType(),False),
StructField(“标记”,ArrayType(StringType()),False)
]))
#假设存在RDD(双精度,数组(字符串))
hashingTF=hashingTF(numFeatures=compositions,inputCol=“tokens”,outputCol=“features”)
ndf=哈希变换(temp_df)
ndf.printSchema()
#输出
#根
#|--标签:双精度(nullable=false)
#|--令牌:数组(nullable=false)
#||--元素:字符串(containsnall=true)
#|--特征:向量(可空=真)
所以我的问题是,我能不能让(RDDof(String,SparseVector)把它转换成(DataFrameof(String,vector))。 我尝试了常用的
sqlContext.createDataFrame
,但是没有适合我需要的

df=sqlContext.createDataFrame(rdd,StructType([
StructField(“标签”,StringType(),True),
StructField(“功能”,类型(),True)
]))

您必须在此处使用
VectorUDT

# In Spark 1.x
# from pyspark.mllib.linalg import SparseVector, VectorUDT
from pyspark.ml.linalg import SparseVector, VectorUDT

temp_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

temp_rdd.toDF(schema).printSchema()

## root
##  |-- label: double (nullable = true)
##  |-- features: vector (nullable = true)
仅为了完整性,Scala等效:

import org.apache.spark.sql.Row
导入org.apache.spark.rdd.rdd
导入org.apache.spark.sql.types.{DoubleType,StructType}
//在Spark 1x中。
//导入org.apache.spark.mllib.linalg.{Vectors,VectorUDT}
导入org.apache.spark.ml.linalg.Vectors
导入org.apache.spark.ml.linalg.SQLDataTypes.VectorType
val schema=new StructType()
.添加(“标签”,双重类型)
//在Spark 1.x中
//.add(“功能”,新的VectorUDT())
.添加(“功能”,矢量类型)
val temp_rdd:rdd[行]=sc.parallelize(顺序(
行(0.0,Vectors.sparse(4,Seq((1,1.0),(3,5.5)),
行(1.0,Vectors.sparse(4,Seq((0,-1.0),(2,0.5)))
))
createDataFrame(temp_rdd,schema).printSchema
//根
//|--label:double(nullable=true)
//|--特征:向量(nullable=true)
虽然@zero323 answer是有意义的,我希望它能为我工作-数据框架的底层rdd,sqlContext.createDataFrame(temp_rdd,schema),仍然包含SparseVector类型 我必须执行以下操作才能转换为DenseVector类型-如果有人有我想知道的更短/更好的方法

temp_rdd = sc.parallelize([
    (0.0, SparseVector(4, {1: 1.0, 3: 5.5})),
    (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))])

schema = StructType([
    StructField("label", DoubleType(), True),
    StructField("features", VectorUDT(), True)
])

temp_rdd.toDF(schema).printSchema()
df_w_ftr = temp_rdd.toDF(schema)

print 'original convertion method: ',df_w_ftr.take(5)
print('\n')
temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
print type(temp_rdd_dense), type(temp_rdd)
print 'using map and toArray:', temp_rdd_dense.take(5)

temp_rdd_dense.toDF().show()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

original convertion method:  [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))]


<class 'pyspark.rdd.PipelinedRDD'> <class 'pyspark.rdd.RDD'>
using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)]

+------------------+-----+
|          features|label|
+------------------+-----+
| [0.0,1.0,0.0,5.5]|  0.0|
|[-1.0,0.0,0.5,0.0]|  1.0|
+------------------+-----+
temp\u rdd=sc.parallelize([
(0.0,SparseVector(4,{1:1.0,3:5.5})),
(1.0,SparseVector(4,{0:-1.0,2:0.5})))
schema=StructType([
StructField(“标签”,DoubleType(),True),
StructField(“功能”,VectorUDT(),True)
])
temp_rdd.toDF(schema).printSchema()
df_w_ftr=temp_rdd.toDF(模式)
打印“原始转换方法:”,df_w_ftr.take(5)
打印(“\n”)
temp\u rdd\u densite=temp\u rdd.map(lambda x:Row(label=x[0],features=DenseVector(x[1].toArray()))
打印类型(temp\u rdd\u density),类型(temp\u rdd)
打印“使用映射和toArray:”,温度密度。取(5)
temp_rdd_density.toDF().show()
根
|--标签:双精度(nullable=true)
|--特征:向量(可空=真)
原始转换方法:[行(label=0.0,features=SparseVector(4,{1:1.0,3:5.5})),行(label=1.0,features=SparseVector(4,{0:-1.0,2:0.5}])]
使用map和toArray:[行(要素=DenseVector([0.0,1.0,0.0,5.5]),标签=0.0),行(要素=DenseVector([-1.0,0.0,0.5,0.0]),标签=1.0]
+------------------+-----+
|特征|标签|
+------------------+-----+
| [0.0,1.0,0.0,5.5]|  0.0|
|[-1.0,0.0,0.5,0.0]|  1.0|
+------------------+-----+

这是scala中spark 2.1的一个示例

import org.apache.spark.ml.linalg.Vector

def featuresRDD2DataFrame(features: RDD[Vector]): DataFrame = {
    import sparkSession.implicits._
    val rdd: RDD[(Double, Vector)] = features.map(x => (0.0, x))
    val df = rdd.toDF("label","features").select("features")
    df
  }

编译器在功能rdd上无法识别
toDF()
,哇,我已经找了很久了!几乎是幸福的哭泣:,)这奏效了!非常感谢你!你能告诉我在文件里的什么地方吗?在linalg apache spark上找不到任何矢量Docs@OrangelMarquez可能需要拉取请求我不知道文档,但Spark source是一个有用的资源: