Java 如何(本地)在给定特征向量和解释方差分数的情况下执行PCA特征选择
我尝试在Java中使用PCA进行功能选择,而不使用ML框架,只使用Apache math matrix库 输入测试数据是一个2D数组,4个特征列x 100行实例。我大致经历了以下步骤: 加载数据、归一化、存储在RealMatrix类中并计算协方差矩阵:Java 如何(本地)在给定特征向量和解释方差分数的情况下执行PCA特征选择,java,machine-learning,pca,feature-selection,Java,Machine Learning,Pca,Feature Selection,我尝试在Java中使用PCA进行功能选择,而不使用ML框架,只使用Apache math matrix库 输入测试数据是一个2D数组,4个特征列x 100行实例。我大致经历了以下步骤: 加载数据、归一化、存储在RealMatrix类中并计算协方差矩阵: PCAResultSet pcaResultSet = new PCAResultSet(); double[][] data = dataToDoubleArray(); if (this.normalize == true) dat
PCAResultSet pcaResultSet = new PCAResultSet();
double[][] data = dataToDoubleArray();
if (this.normalize == true)
data = StatMath.normalize(data);
RealMatrix origData = MatrixUtils.createRealMatrix(data);
Covariance covariance = new Covariance(origData);
/* The eigenvectors of the covariance matrix represent the
* principal components (the directions of maximum variance)
*/
RealMatrix covarianceMatrix = covariance.getCovarianceMatrix();
进行特征分解,得到特征向量和特征值
/* Each of those eigenvectors is associated with an eigenvalue which can be
* interpreted as the “length” or “magnitude” of the corresponding eigenvector.
* If some eigenvalues have a significantly larger magnitude than others,
* then the reduction of the dataset via PCA onto a smaller dimensional subspace
* by dropping the “less informative” eigenpairs is reasonable.
*
* Eigenvectors represent the relative basis (axis) for the data
*
* Computes new variables from the PCA analysis
*/
EigenDecomposition decomp = new EigenDecomposition(covarianceMatrix);
/* The numbers on the diagonal of the diagonalized covariance matrix
* are called eigenvalues of the covariance matrix. Large eigenvalues
* correspond to large variances.
*/
double[] eigenvalues = decomp.getRealEigenvalues();
/* The directions of the new rotated axes are called the
* eigenvectors of the covariance matrix.
*
* Rows are eigenvectors
*/
RealMatrix eigenvectors = decomp.getV();
pcaResultSet.setEigenvectors(eigenvectors);
pcaResultSet.setEigenvalues(eigenvalues);
选择默认情况下排列的前n个特征向量,然后通过将转置的n x m特征向量与转置的原始数据相乘来投影数据
/* Keep the first n-cols corresponding to the
* highest PCAs
*/
int rows = eigenvectors.getRowDimension();
int cols = 1;
RealMatrix evecTran = eigenvectors.getSubMatrix(0, rows - 1, 0, cols - 1).transpose();
RealMatrix origTran = origData.transpose();
/* The projected data onto the lower-dimension hyperplane */
RealMatrix dataProj = evecTran.multiply(origTran).transpose();
最后计算各主成分的解释方差
/* The variance explained ratio of an eigenvalue λ_j is
* simply the fraction of an eigenvalue λ_j and the total
* sum of the eigenvalues
*/
double[] explainedVariance = new double[eigenvalues.length];
double sum = StatMath.sum(eigenvalues);
for (int i = 0; i < eigenvalues.length; i++)
explainedVariance[i] = ((eigenvalues[i] / sum) * 100);
pcaResultSet.setExplainedVariance(explainedVariance);
pcaResultSet.print();
Utils.print("PCA", "Projected Data:", 0, true);
printMatrix(dataProj);
return pcaResultSet;
使用这段代码,PC1解释了大约90%的方差,但我如何利用这一结果来执行特征选择,以确定从原始数据中删除哪些特征
像Weka这样的框架将对特征进行排序,以显示原始集合中的哪种组合产生最高的结果,我也在尝试这样做,但不确定特征向量/decop分数如何映射回原始特征从您的问题中我了解到,您希望使用PCA进行特征选择或消除 其中一种方法是采用重建误差 要计算重建误差,需要进行逆PCA,以便从主分量中获得2D阵列的原始值。我们将其称为ReconstrucrData,您的原始数组是originalData 现在找到错误矩阵let的名称为errorMat,它只不过是重建数据-原始数据 现在,在这个errorMat中,找到每列的MAE。现在,可以选择MAE最低的前n列,也可以拒绝MAE最高的前m列 抱歉,我不懂JAVA,所以无法发布代码。但是我可以在概念上帮助您,所以如果您在实现上述逻辑时遇到任何困难,请告诉我