Apache spark Pyspark和PCA：如何提取此PCA的特征向量？我如何计算他们解释的差异？_Apache Spark_Apache Spark Sql_Pyspark_Pca_Apache Spark Ml

Apache spark Pyspark和PCA：如何提取此PCA的特征向量？我如何计算他们解释的差异？

apache-spark pyspark

Apache spark Pyspark和PCA：如何提取此PCA的特征向量？我如何计算他们解释的差异？,apache-spark,apache-spark-sql,pyspark,pca,apache-spark-ml,Apache Spark,Apache Spark Sql,Pyspark,Pca,Apache Spark Ml,我使用Spark ml库使用pyspark的PCA模型降低Spark数据帧的维数，如下所示： pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) 其中，数据是一个Spark数据框，其中一列标记为features，这是一个三维密度向量： data.take(1) Row(features=DenseVector([0.4536,-0.4321

我使用Spark ml库使用pyspark的PCA模型降低Spark数据帧的维数，如下所示：

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

其中，数据是一个Spark数据框，其中一列标记为features，这是一个三维密度向量：

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')

拟合后，我将变换数据：

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))

如何提取此PCA的特征向量？我如何计算他们解释的差异程度？

[更新：从Spark 2.2开始，PySpark中的PCA和SVD都可用-请参阅JIRA票证，对于Spark ML 2.2，&以下原始答案仍然适用于较旧的Spark版本。]

好吧，这看起来不可思议，但事实上，至少在Spark 1.5中，没有一种方法可以从PCA分解中提取这样的信息。但同样，也有许多类似的抱怨——例如，请参见未能从交叉验证模型中提取最佳参数

幸运的是，几个月前，我参加了AMPLab Berkeley&Databricks（即Spark的创建者）举办的MOOC，在那里我们“手工”实施了完整的PCA管道，作为家庭作业的一部分。我从那时起就修改了我的函数，请放心，我获得了全部学分：-，以便使用数据帧作为输入，而不是RDD，其格式与您的相同，即包含数字特征的densevector行

我们首先需要定义一个中间函数estimatedCovariance，如下所示：

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

将numpy作为np导入 def估计协方差F：计算给定数据帧的协方差矩阵。注: 多维协方差阵列应使用外积计算。不要忘记先减去平均值来规范化数据。 Args： df：一个Spark数据框，其列名为“features”，该列由DenseVector组成。返回： np.ndarray：一种多维数组，其中行数和列数均等于输入数据帧中数组的长度。 m=df。选择df['features']。maplambda x:x[0]。平均值 dfZeroMean=df。选择df['features']。maplambda x:x[0]。maplambda x:x-m减去平均值返回dfZeroMean.maplambda x:np.outerx，x.sum/df.count 然后，我们可以编写一个主pca函数，如下所示：

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

从numpy.linalg进口八 def pcadf，k=2：计算前'k'主成分、相应分数和所有特征值。注: 所有特征值应按从大到小的顺序返回`八归每个特征向量作为一列。此函数还应将特征向量作为列返回。 Args： df：一个带有“features”列的Spark数据框，该列由DenseVector组成。 k int：要返回的主成分数。返回： np.ndarray的元组，np.ndarray的RDD，np.ndarray：特征向量的元组，`RDD`of 分数，特征值。特征向量是一个多维数组，其中行数等于输入“RDD”中数组的长度，列数等于 `k`。分数的“RDD”与“data”的行数相同，由数组组成长度为'k`。特征值是一个长度为特征数的数组。 cov=估计协方差 col=冠形[1] eigVals，eigVecs=eighcov inds=np.argsortigVals eigVecs=eigVecs.T[inds[-1:-col+1:-1]] 组件=eigVecs[0:k] eigVals=eigVals[inds[-1:-col+1:-1]]排序特征值 score=df.selectdf['features'].maplambda x:x[0]。maplambda x:np.dotx，components.T 返回'k'主成分、'k'分数和所有特征值返回组件。T、分数、eigVals 试验

让我们先看看现有方法的结果，使用Spark ML PCA的示例数据修改它们，使其成为所有的密度系数：

从pyspark.ml.feature导入* 从pyspark.mllib.linalg导入向量数据=[Vectors.dense[0.0,1.0,0.0,7.0,0.0]，，向量.密集[2.0,0.0,3.0,4.0,5.0]，，向量。密集[4.0,0.0,0.0,6.0,7.0]，] df=sqlContext.createDataFramedata[features] pca\u extracted=PCAk=2，inputCol=features，outputCol=pca\u features 模型=pca_extracted.fitdf model.transformdf.collect [Rowfeatures=DenseVector[0.0,1.0,0.0,7.0,0.0]，pca_features=DenseVector[1.6486，-4.0133]， Rowfeatures=DenseVector[2.0,0.0,3.0,4.0,5.0]，pca_features=DenseVector[-4.6451，-1.1168]， Rowfeatures=DenseVector[4.0,0.0,0.0,6.0,7.0]，pca_features=DenseVector[-6.4289，-5.338]] 然后，用我们的方法：

薪酬、分数、eigVals=pcadf 记分 [阵列[1.64857282,4.0132827],，阵列[-4.64510433，1.11679727]，阵列[-6.42888054，5.33795143]] 让我强调一下，我们在定义的函数中不使用任何collect方法——score是一个RDD，应该是这样的

注意，我们的第二个列都与现有方法导出的列相反；但这不是一个问题：根据黑斯蒂和蒂布什拉尼合著的可免费下载的报告，p。382

每个主分量加载向量都是唯一的，直至符号翻转。这这意味着两个不同的软件包将产生相同的主体组件加载向量，尽管这些加载向量的符号可能会有所不同。符号可能不同，因为每个主成分的荷载向量指定p维空间中的方向：翻转符号没有方向不变时的效果。[…]同样，分数向量也是唯一的直到符号翻转，因为Z的方差与−Z

最后，既然我们有了可用的特征值，就可以编写一个方差百分比的函数了：

def差异解释DF，k=1：计算由顶部'k'特征向量解释的方差分数。 Args： df：一个带有“features”列的Spark数据框，该列由DenseVector组成。 K：要考虑的主要成分的数量。返回：浮动：介于0和1之间的数字，表示解释的方差百分比通过顶部的'k'特征向量。成分、分数、特征值=pcadf，k 返回sum特征值[0:k]/sum特征值解释的差异DF，1 0.79439325322305299 作为测试，我们还检查示例数据中解释的方差是否为1.0，对于k=5，因为原始数据是5维的：

差异解释DF，5 1 [使用Spark 1.5.0和1.5.1开发和测试]

编辑：

根据解决的JIRA问题，PCA和SVD最终都可以在pyspark启动spark 2.2.0中使用

原始答复：

@desertnaut给出的答案实际上从理论角度来看非常好，但我想介绍另一种方法，即如何计算奇异值分解并提取特征向量

from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix

class SVD(JavaModelWrapper):
    """Wrapper around the SVD scala case class"""
    @property
    def U(self):
        """ Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
        u = self.call("U")
        if u is not None:
        return RowMatrix(u)

    @property
    def s(self):
        """Returns a DenseVector with singular values in descending order."""
        return self.call("s")

    @property
    def V(self):
        """ Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
        return self.call("V")

这定义了我们的SVD对象。我们现在可以使用Java包装器定义computeSVD方法

def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
    """
    Computes the singular value decomposition of the RowMatrix.
    The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
    * s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
    * U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
    * v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
    :param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
    :param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
    :param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
    :returns: SVD object
    """
    java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
    return SVD(java_model)

现在，让我们将其应用于一个示例：

from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors

data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])

pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")

model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features

# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)

# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)

对您的问题最简单的回答是向您的模型输入一个身份矩阵

identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
              (Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
              (Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)

这将为您提供主要组件

我认为eliasah的答案在Spark框架方面更好，因为desertnaut通过使用numpy的函数而不是Spark的操作来解决问题。然而，eliasah的答案是缺少数据的标准化。因此，我要在eliasah的回答中添加以下几行：

from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
                          inputCol='features',
                          outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)

实际上，svd.V和identity_features.selectpca_features.collect应具有相同的值

在本文中，我总结了PCA及其在Spark和sklearn中的使用。

在Spark 2.2+中，您现在可以很容易地得到解释的方差，如下所示：

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features")
df = assembler.transform(<your original dataframe>).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
sum(model.explainedVariance)

你想过公关吗？@zero323是的，但如果我没有弄错的话，似乎已经有公关了。@zero323看看我根据这个问题开始的这个问题，相关的公关谢谢你没有在你的论文中提到我！我相信这就是我答案中的代码。我在评论中引用了你的代码，给出了链接。而且我不知道你的名字。如果你想让我再写一封感谢信，请告诉我。而且，这不是一篇论文。这只是我和一位朋友写的一篇文章，目的是帮助人们理解事情。不过，当涉及到我的工作时，我宁愿被引用。如果我用你的，我也会这么做。这是社区协作规则的一部分，也是StackOverflow许可证的一部分。您还可以在我的SO个人资料中查看我的联系方式。我通常很友好-好吧我会更新文章并重新分享。感谢大家的提醒。对不起，在否决表决时，问题更多的是如何用解释方差识别列，而不是单独提取解释方差；这不是一个直接的问题，但我很确定这就是目的。