Java 如何(本地)在给定特征向量和解释方差分数的情况下执行PCA特征选择

Java 如何(本地)在给定特征向量和解释方差分数的情况下执行PCA特征选择,java,machine-learning,pca,feature-selection,Java,Machine Learning,Pca,Feature Selection,我尝试在Java中使用PCA进行功能选择,而不使用ML框架,只使用Apache math matrix库 输入测试数据是一个2D数组,4个特征列x 100行实例。我大致经历了以下步骤: 加载数据、归一化、存储在RealMatrix类中并计算协方差矩阵: PCAResultSet pcaResultSet = new PCAResultSet(); double[][] data = dataToDoubleArray(); if (this.normalize == true) dat

我尝试在Java中使用PCA进行功能选择,而不使用ML框架,只使用Apache math matrix库

输入测试数据是一个2D数组,4个特征列x 100行实例。我大致经历了以下步骤:

加载数据、归一化、存储在RealMatrix类中并计算协方差矩阵:

PCAResultSet pcaResultSet = new PCAResultSet();
double[][] data = dataToDoubleArray();

if (this.normalize == true)
    data = StatMath.normalize(data);

RealMatrix origData = MatrixUtils.createRealMatrix(data);
Covariance covariance = new Covariance(origData);

/* The eigenvectors of the covariance matrix represent the 
 * principal components (the directions of maximum variance) 
 */
RealMatrix covarianceMatrix = covariance.getCovarianceMatrix();
进行特征分解,得到特征向量和特征值

/* Each of those eigenvectors is associated with an eigenvalue which can be 
 * interpreted as the “length” or “magnitude” of the corresponding eigenvector. 
 * If some eigenvalues have a significantly larger magnitude than others, 
 * then the reduction of the dataset via PCA onto a smaller dimensional subspace 
 * by dropping the “less informative” eigenpairs is reasonable.
 * 
 *  Eigenvectors represent the relative basis (axis) for the data
 *  
 *  Computes new variables from the PCA analysis
 */
EigenDecomposition decomp = new EigenDecomposition(covarianceMatrix);

/* The numbers on the diagonal of the diagonalized covariance matrix 
 * are called eigenvalues of the covariance matrix. Large eigenvalues 
 * correspond to large variances. 
 */
double[] eigenvalues = decomp.getRealEigenvalues();

/* The directions of the new rotated axes are called the 
 * eigenvectors of the covariance matrix.
 * 
 * Rows are eigenvectors 
 */
RealMatrix eigenvectors = decomp.getV(); 

pcaResultSet.setEigenvectors(eigenvectors);
pcaResultSet.setEigenvalues(eigenvalues);
选择默认情况下排列的前n个特征向量,然后通过将转置的n x m特征向量与转置的原始数据相乘来投影数据

/* Keep the first n-cols corresponding to the
 * highest PCAs
 */
int rows = eigenvectors.getRowDimension();
int cols = 1;

RealMatrix evecTran = eigenvectors.getSubMatrix(0, rows - 1, 0, cols - 1).transpose();
RealMatrix origTran = origData.transpose();

/* The projected data onto the lower-dimension hyperplane */
RealMatrix dataProj = evecTran.multiply(origTran).transpose();
最后计算各主成分的解释方差

/* The variance explained ratio of an eigenvalue λ_j is 
 * simply the fraction of an eigenvalue λ_j and the total 
 * sum of the eigenvalues 
 */
double[] explainedVariance = new double[eigenvalues.length];
double sum = StatMath.sum(eigenvalues);

for (int i = 0; i < eigenvalues.length; i++)
    explainedVariance[i] = ((eigenvalues[i] / sum) * 100);

pcaResultSet.setExplainedVariance(explainedVariance);
pcaResultSet.print();

Utils.print("PCA", "Projected Data:", 0, true);
printMatrix(dataProj);

return pcaResultSet;
使用这段代码,PC1解释了大约90%的方差,但我如何利用这一结果来执行特征选择,以确定从原始数据中删除哪些特征


像Weka这样的框架将对特征进行排序,以显示原始集合中的哪种组合产生最高的结果,我也在尝试这样做,但不确定特征向量/decop分数如何映射回原始特征

从您的问题中我了解到,您希望使用PCA进行特征选择或消除

其中一种方法是采用重建误差

要计算重建误差,需要进行逆PCA,以便从主分量中获得2D阵列的原始值。我们将其称为ReconstrucrData,您的原始数组是originalData

现在找到错误矩阵let的名称为errorMat,它只不过是重建数据-原始数据

现在,在这个errorMat中,找到每列的MAE。现在,可以选择MAE最低的前n列,也可以拒绝MAE最高的前m列

抱歉,我不懂JAVA,所以无法发布代码。但是我可以在概念上帮助您,所以如果您在实现上述逻辑时遇到任何困难,请告诉我