Scikit learn 连接sklearn管道中不同步骤的功能_Scikit Learn

Scikit learn 连接sklearn管道中不同步骤的功能

scikit-learn

Scikit learn 连接sklearn管道中不同步骤的功能,scikit-learn,Scikit Learn,我想在一个管道中级联4个步骤来构建一个有监督的分类器：（1）使用PCA进行降维，通过c列（组件）获得s行（样本）的矩阵_1 （2）将（1）中的输出矩阵_1馈送到KMeans盲分离，以1列（组标签）获得s行（样本）的矩阵_2 （3）水平连接（1）中的矩阵_1和（2）中的矩阵_2，通过c+1列（c分量加1个标签）获得s行（样本）的矩阵_3 （4）将（3）中的输出矩阵_3输入MLP分类器的神经网络因此，我的管道将如下所示： Pipeline(steps=[('step1', PCA()),

我想在一个管道中级联4个步骤来构建一个有监督的分类器：
（1）使用PCA进行降维，通过c列（组件）获得s行（样本）的矩阵_1

（2）将（1）中的输出矩阵_1馈送到KMeans盲分离，以1列（组标签）获得s行（样本）的矩阵_2

（3）水平连接（1）中的矩阵_1和（2）中的矩阵_2，通过c+1列（c分量加1个标签）获得s行（样本）的矩阵_3

（4）将（3）中的输出矩阵_3输入MLP分类器的神经网络

因此，我的管道将如下所示：

Pipeline(steps=[('step1', PCA()), ('step2', KMeans()), ('step3', myStep3(FastICA().components_, KMeans().labels_)), ('step4', MLPClassifier())])

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

#Generate dummy data
df = pd.DataFrame({"a":np.random.random(100), "b": np.random.random(100), "y": np.random.choice(2,100)})

#Duplicate columns ["a", "b"] into ["a_new", "b_new"]
cols = ["a", "b"]
new_cols = [col_name + "_new" for col_name in cols]
df[new_cols] = df[cols]

#"cols" are receiving only the PCA
#"new_cols" are receiving a Pipeline made of PCA and KMeans
CT = ColumnTransformer([("onlyPCA", PCA(), cols),
                        ("PCA+KMeans", Pipeline([("PCA", PCA()), ("KMeans", KMeans())]), new_cols)])

#Wrap the whole thing into a Pipeline
pipe = Pipeline([("transformer", CT), ("classifier", MLPClassifier())])

pipe.fit(df[cols+new_cols], df.y)

现在我的问题是如何实现管道中的“第三步”。是否有sklearn函数/类可供我替换“myStep3（）”？

一种方法是复制功能列，并对每个列应用单独的转换器。第一组将只获得PCA变换，第二组将获得PCA和KMeans。这可以通过使用

ColumnTransformer

来实现，它为不同的列集分配不同的转换。在一个虚拟示例中，它如下所示：

Pipeline(steps=[('step1', PCA()), ('step2', KMeans()), ('step3', myStep3(FastICA().components_, KMeans().labels_)), ('step4', MLPClassifier())])

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

#Generate dummy data
df = pd.DataFrame({"a":np.random.random(100), "b": np.random.random(100), "y": np.random.choice(2,100)})

#Duplicate columns ["a", "b"] into ["a_new", "b_new"]
cols = ["a", "b"]
new_cols = [col_name + "_new" for col_name in cols]
df[new_cols] = df[cols]

#"cols" are receiving only the PCA
#"new_cols" are receiving a Pipeline made of PCA and KMeans
CT = ColumnTransformer([("onlyPCA", PCA(), cols),
                        ("PCA+KMeans", Pipeline([("PCA", PCA()), ("KMeans", KMeans())]), new_cols)])

#Wrap the whole thing into a Pipeline
pipe = Pipeline([("transformer", CT), ("classifier", MLPClassifier())])

pipe.fit(df[cols+new_cols], df.y)

请注意，您还需要在预测步骤中复制数据：

pipe.predict(df[cols+new_cols])

MaximeKan在上面的回答太棒了。我使用

FeatureUnion

以不同的方式接近它，这样可以避免列重复

Pipeline(steps=[('ftrUn', FeatureUnion([('myDr', PCA()),('myDrKm', Pipeline([('myDr', PCA()),('myKM', KMeans())]))])),('myNN', MLPClassifier())])

这里唯一的问题（也适用于上面的MaximeKan）是KMeans（）的输出。这个输出是样本中心距离，而不是我对簇标签的原始请求。

要获得簇标签，请考虑从<代码> kMedie<代码>继承的自定义类，但是重写<代码>转换（自我，X）< /> >作为<代码>返回自我。预测（x）< /代码>？您也可以使用一个

堆叠

集合，但随后您将得到一些交叉验证，这些验证可能是您不想要/不需要的。