Python PySpark主成分分析

Python PySpark主成分分析,python,csv,pyspark,analysis,bigdata,Python,Csv,Pyspark,Analysis,Bigdata,我正在使用PySpark作为工具进行PCA分析,但由于从csv文件读取的数据的兼容性,我遇到了错误。我该怎么办?你能帮帮我吗 from __future__ import print_function from pyspark.ml.feature import PCA from pyspark.ml.linalg import Vectors, VectorUDT from pyspark.sql import SparkSession from pyspark import SparkCo

我正在使用PySpark作为工具进行PCA分析,但由于从csv文件读取的数据的兼容性,我遇到了错误。我该怎么办?你能帮帮我吗

from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors, VectorUDT

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
import pandas as pd
import numpy as np
from numpy import array


conf = SparkConf().setAppName("building a warehouse")
sc = SparkContext(conf=conf)

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("PCAExample")\
        .getOrCreate()



   data = sc.textFile('dataset.csv') \
        .map(lambda line:  line.split(','))\
        .collect()
   #create a data frame from data read from csv file 
   df = spark.createDataFrame(data, ["features"])
   #convert data to vector udt

   df.show()


   pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
   model = pca.fit(df)

   result =  model.transform(df).select("pcaFeatures")
   result.show(truncate=False)

   spark.stop()
下面是我得到的错误:

File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py", line 38, in       <module>
model = pca.fit(df)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually StringType.'
文件“C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py”,第38行,在
模型=主成分分析拟合(df)
pyspark.sql.utils.IllegalArgumentException:u'要求失败:列功能的类型必须为org.apache.spark.ml.linalg。VectorUDT@3bfc3ba7但事实上,这是一种StringType

此处错误指定自身列需要为
VectorUDT
而不是
StringType
。因此,这将对您有效:-

from pyspark.mllib.linalg import SparseVector, VectorUDT       
from pyspark.sql.types import StringType, StructField, StructType
df = spark.createDataFrame(data, StructType([
                         StructField("features", VectorUDT(), True)
                       ]))

此处错误指定自身列需要为
VectorUDT
,而不是
StringType
。因此,这将对您有效:-

from pyspark.mllib.linalg import SparseVector, VectorUDT       
from pyspark.sql.types import StringType, StructField, StructType
df = spark.createDataFrame(data, StructType([
                         StructField("features", VectorUDT(), True)
                       ]))

你能提供一个文件的例子吗?谢谢。它包含如下数据:1544717693328857458783894531251773371124267578,0,030145232421875127862388610839843011359537512652108001708984512086364746093755144682617187557190142822265625574243164062558660888671875571642933375,您的数字仍然作为字符串而不是浮点数读取,按如下方式进行映射:
data=sc.textFile('dataset.csv').map(lambda行:[float(k)代表k in line.split(',')])
我尝试了您的指令,但在包含model=pca.fit(df)的行中仍然出现错误:u'要求失败:列功能的类型必须为org.apache.spark.ml.linalg。VectorUDT@3bfc3ba7但实际上是双重类型。@MehdiBenHamida您需要将列类型
StringType
更改为
VectorUDT
您能提供一个文件示例吗?谢谢。它包含如下数据:1544717693328857458783894531251773371124267578,0,030145232421875127862388610839843011359537512652108001708984512086364746093755144682617187557190142822265625574243164062558660888671875571642933375,您的数字仍然作为字符串而不是浮点数读取,按如下方式进行映射:
data=sc.textFile('dataset.csv').map(lambda行:[float(k)代表k in line.split(',')])
我尝试了您的指令,但在包含model=pca.fit(df)的行中仍然出现错误:u'要求失败:列功能的类型必须为org.apache.spark.ml.linalg。VectorUDT@3bfc3ba7但实际上是双重类型。@MehdiBenHamida您需要将列类型
StringType
更改为
VectorUDT