Python PCA实现

Python PCA实现,python,pca,Python,Pca,我正在做一项作业,我的任务是用Python为在线课程实现PCA。不幸的是,当我尝试在我的实现和SKLearn之间运行课程提供的比较时,我的结果似乎相差太大 经过数小时的审查,我仍然不确定哪里出了问题。如果有人能看一看并确定我错误地编码或解释了哪一步,我将不胜感激 def normalize(X): """ Normalize the given dataset X to have zero mean. Args: X: nd

我正在做一项作业,我的任务是用Python为在线课程实现PCA。不幸的是,当我尝试在我的实现和SKLearn之间运行课程提供的比较时,我的结果似乎相差太大

经过数小时的审查,我仍然不确定哪里出了问题。如果有人能看一看并确定我错误地编码或解释了哪一步,我将不胜感激

def normalize(X):
    """
    Normalize the given dataset X to have zero mean.

    Args:
        X: ndarray, dataset of shape (N,D)
    Returns:
        (Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
        with mean 0; mean is the sample mean of the dataset.

    Note: 
        You will encounter dimensions where the standard deviation is zero.

        For those ones, the process of normalization results in normalized data with NaN entries.  

        We can handle this by setting the std = 1 for those dimensions when doing normalization.  
    """
    # YOUR CODE HERE
    ### Uncomment and modify the code below
    mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise.  Setting it to 1 will compute the mean across rows.  
    std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.  
    std_filled = std.copy() 
    std_filled[std == 0] = 1
    # Compute the normalized data as Xbar 
    Xbar = (X - mu)/std_filled
    return Xbar, mu, # std_filled

def eig(S):
    """
    Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.

    Args:
        S: ndarray, covariance matrix

    Returns:
        (eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors

    Note:
        the eigenvals and eigenvecs should be sorted in descending
        order of the eigen values
    """
    # YOUR CODE HERE
    # Uncomment and modify the code below
    # Compute the eigenvalues and eigenvectors
    # You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
    eigvals, eigvecs = np.linalg.eig(S)
    # The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
    # We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
    # of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
    sort_indices = np.argsort(eigvals)[::-1]
    # Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
    return eigvals[sort_indices], eigvecs[:, sort_indices]


def projection_matrix(B):
    """Compute the projection matrix onto the space spanned by the columns of `B`
    Args:
        B: ndarray of dimension (D, M), the basis for the subspace

    Returns:
        P: the projection matrix
    """
    # YOUR CODE HERE
    P = B @ (np.linalg.inv(B.T @ B)) @ B.T
    return P

def select_components(eig_vals, eig_vecs, num_components):
    """ 
    Selects the n components desired for projecting the data upon.  

    Args:
        eig_vals: The eigenvalues sorted in descending order of magnitude. 
        eig_vecs:  The eigenvectors sorted in order relative to that of the eigenvalues.
        num_components: the number of principal components to use.  
    Returns: 
        The number of desired components to keep for projection of the data upon. 
    """
    principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]

    return principal_vals, principal_components


def PCA(X, num_components):
    """
    Projects normalized data onto the 'n' desired principal components.

    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
        and N is the number of datapoints
        num_components: the number of principal components to use.
    Returns:
        the reconstructed data, the sample mean of the X, principal values
        and principal components
    """
    # Normalize to have mean 0 and variance 1.
    Z, mean_vec = normalize(X) 
    # Calculate the covariance matrix 
    S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables.  Set bias = True to ensure normalization is done with N and not N-1
    # Calculate the (unit) eigenvectors and eigenvalues of S.  Sort them in descending order of importance relative to the magnitude of the eigenvalues.  
    eig_vals, eig_vecs = eig(S)
    # Keep only the n largest Principle Components of the sorted unit eigenvectors.
    principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
    # Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.  
    #P = projection_matrix(eig_vecs[:, :num_components])
    P = projection_matrix(principal_components)
    # Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
    X_reconst = (P @ X.T).T 

    return X_reconst, mean_vec, principal_vals, principal_components
这是我应该通过的测试用例:

random = np.random.RandomState(0)
X = random.randn(10, 5)

from sklearn.decomposition import PCA as SKPCA

for num_component in range(1, 4):
    # We can compute a standard solution given by scikit-learn's implementation of PCA
    pca = SKPCA(n_components=num_component, svd_solver="full")
    sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
    reconst, _, _, _ = PCA(X, num_component)
    # The difference in the result should be very small (<10^-20)
    print(
        "difference in reconstruction for num_components = {}: {}".format(
            num_component, np.square(reconst - sklearn_reconst).sum()
        )
    )
    np.testing.assert_allclose(reconst, sklearn_reconst)

据我所知,你的代码有一些地方出了问题

你的投影矩阵是错误的

如果协方差矩阵的特征向量是B,维数为D x M,其中M是您选择的分量数,D是原始数据的维数,那么投影矩阵就是B@B.T

在PCA的标准实现中,我们通常不通过标准偏差的倒数来缩放数据。你似乎试图做一个近似的白色PCA ZCA,但即使如此,它看起来还是错的

作为一个快速测试,您可以在不除以标准偏差的情况下计算规范化数据,并且在计算协方差矩阵时,将bias设置为False

您还应该先从数据中减去平均值,然后再乘以投影运算符,然后再将其相加,即。, X_重构=P@X-平均向量.T.T+平均向量

PCA本质上只是一个基础的变化,然后丢弃与低方差方向对应的坐标。协方差矩阵的特征向量对应于新的正交基,特征值告诉您数据沿相应特征向量方向的方差。P=B@B.T只是将基频更改为新基频,并丢弃一些坐标B,然后更改回原始基频

编辑
我很想知道哪门在线课程教人们用这种方式实现PCA。

据我所知,您的代码有一些地方出了问题

你的投影矩阵是错误的

如果协方差矩阵的特征向量是B,维数为D x M,其中M是您选择的分量数,D是原始数据的维数,那么投影矩阵就是B@B.T

在PCA的标准实现中,我们通常不通过标准偏差的倒数来缩放数据。你似乎试图做一个近似的白色PCA ZCA,但即使如此,它看起来还是错的

作为一个快速测试,您可以在不除以标准偏差的情况下计算规范化数据,并且在计算协方差矩阵时,将bias设置为False

您还应该先从数据中减去平均值,然后再乘以投影运算符,然后再将其相加,即。, X_重构=P@X-平均向量.T.T+平均向量

PCA本质上只是一个基础的变化,然后丢弃与低方差方向对应的坐标。协方差矩阵的特征向量对应于新的正交基,特征值告诉您数据沿相应特征向量方向的方差。P=B@B.T只是将基频更改为新基频,并丢弃一些坐标B,然后更改回原始基频

编辑 我很想知道哪门在线课程教人们用这种方式实现PCA