Python 多个csv文件的PCA绘图

Python 多个csv文件的PCA绘图,python,r,pandas,Python,R,Pandas,我有60个.csv文件,每个文件包含与其他文件相同的列名。这些文件包含99个数字列和一个目标类(良性和恶性)。我需要以下方面的帮助: (1) 规范化并缩放每个.csv文件 (2) 为每个.csv文件绘制一个PCA图(从而使其总共为60个图) (3) 将每个文件名指定为每个图形的标题(例如:cancer_breast.csv、cancer_skin.csv分别指定为图1和图2的标题:“cancer_breast”和“cancer_skin”)。同时使用目标类(良性和恶性)作为图例 (4) 在每个数

我有60个.csv文件,每个文件包含与其他文件相同的列名。这些文件包含99个数字列和一个目标类(良性和恶性)。我需要以下方面的帮助:

(1) 规范化并缩放每个
.csv
文件

(2) 为每个.csv文件绘制一个PCA图(从而使其总共为60个图)

(3) 将每个文件名指定为每个图形的标题(例如:cancer_breast.csv、cancer_skin.csv分别指定为图1和图2的标题:“cancer_breast”和“cancer_skin”)。同时使用目标类(良性和恶性)作为图例

(4) 在每个数字上附加解释的(方差)(比率)值

(5) 在pdf上保存每页4个PCA数字(总共15个pdf文件)

有关一个.csv文件的示例,请参见下文

# PCA for one csv file looks like that


# importing one file  would look like this 

from sklearn.datasets import load_breast_cancer
breast = load_breast_cancer()
breast_data = breast.data

import numpy as np
labels = np.reshape(breast_labels,(569,1))
final_breast_data = np.concatenate([breast_data,labels],axis=1)
import pandas as pd
breast_dataset = pd.DataFrame(final_breast_data)
features = breast.feature_names
features_labels = np.append(features,'label')
breast_dataset.columns = features_labels  
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)

# STEP 1: normalization and transformation 
from sklearn.preprocessing import StandardScaler
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalizing the features

feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)

# Step 2
from sklearn.decomposition import PCA
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

#Step 3a
principal_breast_Df = pd.DataFrame(data = principalComponents_breast
             , columns = ['PC1', 'PC2'])

# step 3b: Explained_variance_ratio
print('Explained variation per principal component: {}'.format(pca_breast.explained_variance_ratio_))

# Step 4
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(principal_breast_Df.loc[indicesToKeep, 'PC1']
               , principal_breast_Df.loc[indicesToKeep, 'PC2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

我的尝试见下文


import glob
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=2)
for ax, file in zip(axs.flatten(), glob.glob("./*csv")):
    df_holder = pd.read_csv(file)

    # I want to include normalize, transformation, PCA,xplained_variance_ratio_ steps as seen in the earlier  example
    
    ax.scatter(df_holder["PC1"], df_holder["PC2"]) # plotting resulting PC1 vs PC2
    ax.set_title(" ") # I want assign each .csv file name as the title
    ax.set_xlabel("PC1")
    ax.set_ylabel("PC2")
    plt.tight_layout()
fig.savefig("pca_scatter.pdf")

欢迎使用Python或R语言解决方案