Python 关于对一个组件应用PCA的问题_Python_Matplotlib_Machine Learning

Python 关于对一个组件应用PCA的问题

python matplotlib machine-learning

Python 关于对一个组件应用PCA的问题,python,matplotlib,machine-learning,Python,Matplotlib,Machine Learning,我有一组数据，我被指定应用PCA并保留一个组件，然后在散点图中可视化分布，该散点图指示每个数据点的类别对于上下文：我们正在处理的数据有三列。X是第1列和第2列，y是第3列，其中包含每个数据点的类这意味着最终的可视化应该是一条水平线，但我没有看到这一点。由此产生的可视化是一个散点图，看起来像一个正线性分布 import pandas as pd import sklearn from sklearn.model_selection import train_test_split import

我有一组数据，我被指定应用PCA并保留一个组件，然后在散点图中可视化分布，该散点图指示每个数据点的类别

对于上下文：我们正在处理的数据有三列。X是第1列和第2列，y是第3列，其中包含每个数据点的类

这意味着最终的可视化应该是一条水平线，但我没有看到这一点。由此产生的可视化是一个散点图，看起来像一个正线性分布

import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap


df = pd.read_csv("data.csv", header=None)
X = df.iloc[:, 0:2].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=np.random)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

pcaObj1 = PCA(n_components=1)
X_train_PCA = pcaObj1.fit_transform(X_train)
X_test_PCA = pcaObj1.transform(X_test)
X_set, y_set = X_test_PCA, y_test
X3 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01))
X3 = np.array(X3)

plt.xlim(X3.min(), X3.max())
plt.ylim(X3.min(), X3.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 0],
                c = ListedColormap(('purple', 'yellow'))(i), label = j)

我看到你有一个除了，但这不是通常的设置。有多个应用程序，但其中一个主要应用程序是。降维就是去除变量，PCA就是为了达到这个目的，对数据进行分类，并根据它们线性解释的总变化量（或相对量）对它们进行排序。由于这不需要测试数据，我们可以将其视为，尽管许多人也更愿意将其称为，因为它通常用于预处理数据，以提高基于预处理数据训练的模型的性能

为了举例，让我生成一个包含10个变量和1000个条目的随机数据集。为1个分量拟合PCA变换，选择一个新变量（特征），该变量是原始变量的线性组合，试图线性解释数据中的最大方差。正如你所说，这是一条数字线；作为一个快速简单的绘图，让我们使用x轴作为新变量数组的索引，y轴作为变量的值

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)

pcaObj1 = PCA(n_components=1)
X_PCA = pcaObj1.fit_transform(X_train)

plt.scatter(range(len(y_labels)), X_PCA, c=['red' if i==0 else 'green' for i in y_labels])
plt.show()

>>> X_PCA.shape
(1000, 1)

您可以看到，这将生成一个1000 x 1的数组，表示您的新变量

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)

pcaObj1 = PCA(n_components=1)
X_PCA = pcaObj1.fit_transform(X_train)

plt.scatter(range(len(y_labels)), X_PCA, c=['red' if i==0 else 'green' for i in y_labels])
plt.show()

>>> X_PCA.shape
(1000, 1)

如果您选择了

n_components=2

，则会有一个包含两个这样的变量的1000 x 2数组。让我们看看这个例子。这一次，我将绘制两个主成分的相对图，而不是使用单个主成分来绘制索引

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)

pcaObj1 = PCA(n_components=2)
X_PCA = pcaObj1.fit_transform(X_train)

plt.scatter(X_PCA[:,0], X_PCA[:,1], c=['red' if i==0 else 'green' for i in y_labels])
plt.show()

现在，我随机生成的数据可能与您的数据集不具有相同的属性。如果您真的希望输出是一条线，那么我会说肯定不是，因为我的示例生成了一个非常复杂的跟踪。即使在2D情况下，您也会看到数据似乎不是按类结构化的，但这正是您对随机数据的期望。

此示例应该给出一些清晰的信息。确保你阅读了所有的评论，这样你就可以了解发生了什么

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")


from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4]  # we take the first four features.
y = iris.target

print(X.sample(5))
print(y.sample(5))

# see how many samples we have of each species 
data["species"].value_counts()


from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns = X.columns)

X_scaled.sample(5)


# try clustering on the 4d data and see if can reproduce the actual clusters.

# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.

# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.

from sklearn.cluster import KMeans

nclusters = 3 # this is the k in kmeans
seed = 0

km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)

# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans


# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
   .map(plt.scatter, "sepal_length", "sepal_width") \
   .add_legend();

您说您有一个问题，但我想您忘了问。我假设您在PCA之前使用了

StandardScaler

对数据进行预处理，但如果您没有显示其用法，则不需要在示例中包含导入。请提供一个你想展示的最简单的工作示例。任何其他导入也一样，比如

train\u test\u split

甚至

sklearn

import sklearn，因为sklearn

提供了与

import sklearn

相同的行为。

y\u test

是表示类别的整数数组吗？谢谢！我在y轴上有两个水平方向（每个类一个），所以为了将它们强制在一起，我创建了一个numpy.ones数组，并将其用作绘图的y轴。@redwytnblak不客气！为便于将来参考，我为类着色的代码

['red'if i==0，else'green'For i in y_labels]

仅适用于两个类。如果您处理的是多类问题，我建议您使用colormap：