C# 如何在Accord.net中正确使用SVD_C#_Svd_Accord.net

C# 如何在Accord.net中正确使用SVD

C# 如何在Accord.net中正确使用SVD,c#,svd,accord.net,C#,Svd,Accord.net,SVD代表奇异值分解，被认为是文本分类中进行特征约简的常用技术。我知道这个原则我一直在使用C#，使用Accord.Net库，在计算TF-IDF时已经有了一个锯齿状数组double[][] 我已经知道我的文档中有4个主题。我想用聚类数k=4来测试Kmean方法。在使用Kmean之前，我想使用SVD进行特征约简。当结果显示时，将近90%的文档被分为1组，其他文档被分为3组。这是一个非常糟糕的结果。我尝试过多次重新运行，但结果没有太大变化。如果我使用PCA而不是SDV，一切都会按预期进行所以，我错

SVD代表奇异值分解，被认为是文本分类中进行特征约简的常用技术。我知道这个原则

我一直在使用C#，使用Accord.Net库，在计算TF-IDF时已经有了一个锯齿状数组

double[][]

我已经知道我的文档中有4个主题。我想用聚类数k=4来测试Kmean方法。在使用Kmean之前，我想使用SVD进行特征约简。当结果显示时，将近90%的文档被分为1组，其他文档被分为3组。这是一个非常糟糕的结果。我尝试过多次重新运行，但结果没有太大变化。如果我使用PCA而不是SDV，一切都会按预期进行

所以，我错在哪里。任何知道这一点的人都可以给我一个示例代码。多谢各位

注意：我的原始TF-IDF有行表示文档，列表示术语

这是我的密码：

        //to matrix because the function SVD requiring input of matrix, not jagged array
        //transpose because the TF-IDF used for SVD has rows representing terms, columns representing documents; 
        var svd = new SingularValueDecomposition(tfidf.ToMatrix().Transpose());
        double[,] U = svd.LeftSingularVectors;
        double[,] S = svd.DiagonalMatrix;
        double[,] V = svd.RightSingularVectors;

        //find the optimal cutoff y so that we retain enough singular values to make up 90% of the energy in S
        //http://infolab.stanford.edu/~ullman/mmds/ch11.pdf, page 18-20
        double energy = 0;
        for (int i = 0; i < S.GetLength(0); i++)
        {
            energy += Math.Pow(S[i, i], 2);
        }

        double percent;
        int y = S.GetLength(0);
        do
        {
            y--;
            double test = 0;
            for (int i = 0; i < y; i++)
            {
                test += Math.Pow(S[i, i], 2);
            }

            percent = test / energy;
        } while (percent >= 0.9);
        y = y + 1;

        //Uk gets all rows, y first columns of U; Sk get y first rows, y first columns of S; Vk get y first rows, all columns of V
        double[,] Uk = U.Submatrix(0, U.GetLength(0) - 1, 0, y - 1);
        double[,] Sk = S.Submatrix(0, y - 1, 0, y - 1);
        double[,] Vk = V.Submatrix(0, y - 1, 0, V.GetLength(1) - 1);

        //reduce dimension according to http://stats.stackexchange.com/questions/107533/how-to-use-svd-for-dimensionality-reduction-to-reduce-the-number-of-columns-fea
        //we tranpose again to have the rows being document, columns being term as original TF-IDF
        //ToArray because the Kmean below acquiring input of jagged array
        tfidf = Uk.Multiply(Sk).Transpose().ToArray();
        // if tfidf = Uk.Multiply(Sk).Multiply(Vk).Transpose().ToArray()
        // result still bad

        // Create a K-Means algorithm using given k and a square Euclidean distance as distance metric.
        var kmeans = new KMeans(4, Distance.SquareEuclidean) { Tolerance = 0.05 };
        int[] labels = kmeans.Compute(tfidf);

//到矩阵，因为函数SVD需要输入矩阵，而不是交错数组
//转置，因为用于SVD的TF-IDF具有表示术语的行，表示文档的列；
var svd=新的奇异值分解（tfidf.ToMatrix（）.Transpose（））；
双[，]U=奇异值向量；
双[，]S=奇异值分解对角矩阵；
双[，]V=奇异值向量；
//找到最佳截止点y，这样我们就可以保留足够的奇异值，以构成S中90%的能量
//http://infolab.stanford.edu/~ullman/mmds/ch11.pdf，第18-20页
双能量=0；
for（int i=0；i=0.9）；
y=y+1；
//Uk获取所有行，y是U的第一列；Sk获取S的y个第一行、y个第一列；Vk获取y第一行，V的所有列
double[，]Uk=U.Submatrix（0，U.GetLength（0）-1,0，y-1）；
双[，]Sk=S.子矩阵（0，y-1，0，y-1）；
double[，]Vk=V.Submatrix（0，y-1，0，V.GetLength（1）-1）；
//根据需要缩小尺寸http://stats.stackexchange.com/questions/107533/how-to-use-svd-for-dimensionality-reduction-to-reduce-the-number-of-columns-fea
//我们再次提出将行作为文档，列作为术语作为原始TF-IDF
//ToArray，因为下面的Kmean获取了锯齿阵列的输入
tfidf=Uk.Multiply（Sk.Transpose（）.ToArray（）；
//如果tfidf=Uk.Multiply（Sk）.Multiply（Vk）.Transpose（）.ToArray（）
//结果还是不好
//使用给定的K和平方欧几里德距离作为距离度量，创建一个K-Means算法。
var kmeans=新kmeans（4，距离平方欧几里德）{公差=0.05}；
int[]labels=kmeans.Compute（tfidf）；

然后，我们根据标签执行一些步骤以了解哪些文档属于哪些组。

Accord.NET中的PCA已经使用SVD进行了计算。有关如何在没有PCA类帮助的情况下手动执行SVD的示例，您可以随时查看

第一步是减去数据的平均值（存储在变量

）：

现在，您可以选择将数据除以它们的标准偏差，从而有效地将数据转换为z分数。此步骤是严格可选的，但当您的数据表示以大幅度变化的数量级为单位收集的变量时（即一列表示以公里为单位的高度，另一列表示以厘米为单位），此步骤可能有意义

现在，x的主分量是Cov（x）的特征向量。因此，如果我们计算“z”（x标准化）的奇异值分解，矩阵V的列（SVD右侧）将是

的主要组件。这样，我们现在要做的是对矩阵z进行奇异值分解（SVD）：

var svd = new JaggedSingularValueDecomposition(matrix,
    computeLeftSingularVectors: false,
    computeRightSingularVectors: true,
    autoTranspose: true);

var singularValues = svd.Diagonal;
var eigenvalues = SingularValues.Pow(2);
var eigenvalues.Divide(x.Rows() - 1);
var componentVectors = svd.RightSingularVectors.Transpose();

如果要执行白化，还可以将向量除以奇异值：

componentVectors = componentVectors.Divide(singularValues, dimension: 1);

现在，如果您想将数据投影到方差的90%，请计算特征值的累积和，如下所示：

// Calculate proportions
var componentProportions = eigenvalues.Abs().Divide(eigenValues.Abs().Sum());

// Calculate cumulative proportions
var componentCumulative = componentProportions.CumulativeSum();

现在，通过查看累积比例大于所需方差比例的位置，确定所需的维度数。知道该数字后，仅从特征向量矩阵中选择这些特征向量：

int numberOfVectors = // number of vectors that you need

for (int i = 0; i < rows; i++)
    for (int j = 0; j < numberOfVectors; j++)
        for (int k = 0; k < componentVectors[j].Length; k++)
            result[i][j] += z[i][k] * componentVectors[j][k];

int numberOfVectors=//需要的向量数
对于（int i=0；i


在上面的例子中，我正在变换矩阵z，它已经以平均值为中心，并且可以选择标准化。在转换另一组数据之前，不要忘记应用与原始矩阵相同的转换
最后，请记住，手动执行上述所有操作是完全可选的。您真的应该免费使用PrincipalComponentAnalysis类来完成所有这些繁重的工作
// Calculate proportions
var componentProportions = eigenvalues.Abs().Divide(eigenValues.Abs().Sum());

// Calculate cumulative proportions
var componentCumulative = componentProportions.CumulativeSum();

int numberOfVectors = // number of vectors that you need

for (int i = 0; i < rows; i++)
    for (int j = 0; j < numberOfVectors; j++)
        for (int k = 0; k < componentVectors[j].Length; k++)
            result[i][j] += z[i][k] * componentVectors[j][k];