Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Pandas pyspark:使用sparkml和spark数据帧创建k-means聚类模型_Pandas_Apache Spark_Pyspark_Spark Dataframe_Apache Spark Ml - Fatal编程技术网

Pandas pyspark:使用sparkml和spark数据帧创建k-means聚类模型

Pandas pyspark:使用sparkml和spark数据帧创建k-means聚类模型,pandas,apache-spark,pyspark,spark-dataframe,apache-spark-ml,Pandas,Apache Spark,Pyspark,Spark Dataframe,Apache Spark Ml,我正在使用以下代码创建集群模型: import pandas as pd pandas_df = pd.read_pickle('df_features.pickle') spark_df = sqlContext.createDataFrame(pandas_df) from pyspark.ml.linalg import Vectors from pyspark.ml.clustering import KMeans kmeans = KMeans(k=2, seed=1.0) mod

我正在使用以下代码创建集群模型:

import pandas as pd
pandas_df = pd.read_pickle('df_features.pickle')
spark_df = sqlContext.createDataFrame(pandas_df)

from pyspark.ml.linalg import Vectors
from pyspark.ml.clustering import KMeans

kmeans = KMeans(k=2, seed=1.0)
modela = kmeans.fit(spark_df)
然后我得到了错误:

AnalysisException                         Traceback (most recent call last)
<ipython-input-26-00e1e2ba1983> in <module>()
      3 
      4 kmeans = KMeans(k=2, seed=1.0)
----> 5 modela = kmeans.fit(spark_df)

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/base.pyc in fit(self, dataset, params)
     62                 return self.copy(params)._fit(dataset)
     63             else:
---> 64                 return self._fit(dataset)
     65         else:
     66             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc in _fit(self, dataset)
    211 
    212     def _fit(self, dataset):
--> 213         java_model = self._fit_java(dataset)
    214         return self._create_model(java_model)
    215 

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc in _fit_java(self, dataset)
    208         """
    209         self._transfer_params_to_java()
--> 210         return self._java_obj.fit(dataset._jdf)
    211 
    212     def _fit(self, dataset):

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    931         answer = self.gateway_client.send_command(command)
    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934 
    935         for temp_arg in temp_args:

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: u"cannot resolve '`features`' given input columns: [field_1, field_2, field_3, field_4, field_5, field_6, field_7];"
AnalysisException回溯(最近一次调用)
在()
3.
4Kmeans=kmeans(k=2,seed=1.0)
---->5型号A=kmeans.fit(火花点火)
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/base.pyc-in-fit(self、dataset、params)
62返回自复制(参数)。_fit(数据集)
63.其他:
--->64返回自拟合(数据集)
65.其他:
66 raise VALUERROR(“参数必须是参数映射或参数映射的列表/元组,”
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc in_-fit(self,dataset)
211
212 def_拟合(自身,数据集):
-->213 java_model=self._fit_java(数据集)
214返回自创建模型(java模型)
215
/java中的home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/ml/wrapper.pyc(self,数据集)
208         """
209 self.\u将参数转移到\u java()
-->210返回self.\u java.\u obj.fit(数据集.\u jdf)
211
212 def_拟合(自身,数据集):
/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip/py4j/java_-gateway.py in u_调用(self,*args)
931 answer=self.gateway\u client.send\u命令(command)
932返回值=获取返回值(
-->933应答,self.gateway_客户端,self.target_id,self.name)
934
935对于临时参数中的临时参数:
/装饰中的home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc(*a,**kw)
67 e.java_exception.getStackTrace())
68如果s.StartWith('org.apache.spark.sql.AnalysisException:'):
--->69 raise AnalysisException(s.split(“:”,1)[1],stackTrace)
70如果s.startswith('org.apache.spark.sql.catalyst.analysis'):
71引发分析异常(s.split(“:”,1)[1],stackTrace)
AnalysisException:u“无法解析给定输入列的'`features`:[字段1、字段2、字段3、字段4、字段5、字段6、字段7];"

我是否创建了错误的数据帧?有人知道我遗漏了什么吗?谢谢!

对于kmeans,它需要一个DenseVectors的rdd。因此,您需要创建一个DenseVectors的rdd,其中每个向量对应于数据帧的一行。因此,假设您的数据帧有三列输入到K Means模型中,我将重构它o遵循以下原则:

spark_rdd = spark_df.rdd.sortByKey()
modelInput = spark_rdd.map(lambda x: Vectors.dense(x[0],x[1],x[2])).sortByKey()
modelObject = Kmeans.train(modelInput,2)
然后,如果您希望将结果从RDD返回到数据帧中,我将执行以下操作:

labels = modelInput.map(lambda x: model.predict(x))
results = labels.zip(spark_rdd)
resultFrame = results.map(lambda x: Row(Label = x[0], Column1 = x[0][1], Column2 = x[1][1],Column3 = x[1][2]).toDF()

对于kmeans,它需要一个DenseVectors的rdd。因此,您需要创建一个DenseVectors的rdd,其中每个向量对应于您的数据帧的一行。因此,假设您的数据帧有三列输入到K Means模型中,我将按照以下方式对其进行重构:

spark_rdd = spark_df.rdd.sortByKey()
modelInput = spark_rdd.map(lambda x: Vectors.dense(x[0],x[1],x[2])).sortByKey()
modelObject = Kmeans.train(modelInput,2)
然后,如果您希望将结果从RDD返回到数据帧中,我将执行以下操作:

labels = modelInput.map(lambda x: model.predict(x))
results = labels.zip(spark_rdd)
resultFrame = results.map(lambda x: Row(Label = x[0], Column1 = x[0][1], Column2 = x[1][1],Column3 = x[1][2]).toDF()
欲知详情


有关详细信息,请使用VectorAssembler


您需要使用矢量汇编程序


试试kmeans.train()?这里没有什么意外。只要你知道为什么这不起作用。试试kmeans.train()?这里没有什么意外。只要你知道为什么这不起作用。