Python scikit学习中不同数据类型的自定义管道

Python scikit学习中不同数据类型的自定义管道,python,pandas,numpy,machine-learning,scikit-learn,Python,Pandas,Numpy,Machine Learning,Scikit Learn,我目前正试图预测kickstarter项目是否会成功,这取决于一组整数和一些文本特性。我正在考虑建造一条类似这样的管道 参考: 这是我的ItemSelector和管道代码 class ItemSelector(BaseEstimator, TransformerMixin): def __init__(self, keys): self.keys = keys def fit(self, x, y=None): return self

我目前正试图预测kickstarter项目是否会成功,这取决于一组整数和一些文本特性。我正在考虑建造一条类似这样的管道

参考:

这是我的ItemSelector和管道代码

class ItemSelector(BaseEstimator, TransformerMixin):    
    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.keys]
我已验证ItemSelector是否按预期工作

t = ItemSelector(['cleaned_text'])
t.transform(df)

And it extract the necessary columns
管道 但是当我运行pipeline.fit(X_-train,y_-train)时,我收到了这个错误。你知道怎么解决这个问题吗

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    266             This estimator
    267         """
--> 268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
    270             self._final_estimator.fit(Xt, y, **fit_params)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    740         self._update_transformer_list(transformers)
    741         if any(sparse.issparse(f) for f in Xs):
--> 742             Xs = sparse.hstack(Xs).tocsr()
    743         else:
    744             Xs = np.hstack(Xs)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    456 
    457     """
--> 458     return bmat([blocks], format=format, dtype=dtype)
    459 
    460 

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    577                                                     exp=brow_lengths[i],
    578                                                     got=A.shape[0]))
--> 579                     raise ValueError(msg)
    580 
    581                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.
---------------------------------------------------------------------------
ValueError回溯(最近一次调用上次)
在()
---->1.管道安装(X_系列、y_系列)
~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py-in-fit(self,X,y,**fit_参数)
266这个估计器
267         """
-->268 Xt,拟合参数=自拟合(X,y,**拟合参数)
269如果self.\u final\u估计器不是无:
270自我最终估计值拟合(Xt,y,**拟合参数)
~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in_fit(self,X,y,**fit_参数)
232通行证
233 elif hasattr(变换,“拟合变换”):
-->234 Xt=transform.fit_transform(Xt,y,**fit_参数_步骤[名称])
235其他:
236 Xt=transform.fit(Xt,y,**fit_参数_步骤[名称])\
拟合转换中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py(self,X,y,**拟合参数)
740自我更新变压器列表(变压器)
741如果有的话(对于Xs中的f,稀疏.issparse(f)):
-->742xs=sparse.hstack(Xs.tocsr())
743其他:
744 Xs=np.hstack(Xs)
hstack中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py(块、格式、数据类型)
456
457     """
-->458返回bmat([blocks],format=format,dtype=dtype)
459
460
bmat中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py(块、格式、数据类型)
577 exp=眉毛长度[i],
578 got=A.shape[0]))
-->579提升值错误(msg)
580
581如果bcol_长度[j]==0:
ValueError:块[0,:]的行维度不兼容。已获取块[0,1]。形状[0]==81096,应为1。

ItemSelector返回的是数据帧,而不是数组。这就是为什么
scipy.hstack
抛出错误的原因。更改ItemSelector,如下所示:

class ItemSelector(BaseEstimator, TransformerMixin):    
    ....
    ....
    ....

    def transform(self, data_dict):
        return data_dict[self.keys].as_matrix()
该错误发生在管道的
integer\u功能
部分。对于第一部分
text
,ItemSelector下面的转换器支持数据帧,因此可以正确地将其转换为数组。但是第二部分只有ItemSelector并返回Dataframe

更新

在注释中,您提到要对ItemSelector返回的结果数据帧执行一些操作。因此,您可以创建一个新的转换器并将其附加到管道的第二部分,而不是修改ItemSelector的transform方法

class DataFrameToArrayTransformer(BaseEstimator, TransformerMixin):    
    def __init__(self):

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return X.as_matrix()
那么您的管道应该如下所示:

pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer', Pipeline([
                ('integer_features', ItemSelector(int_features)),
                ('array', DataFrameToArrayTransformer()),
            ])),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])

这里要了解的主要问题是,FeatureUnion在组合二维数组时只处理二维数组,因此任何其他类型(如DataFrame)都可能会出现问题

您应该发布错误的完整堆栈跟踪。也可以单独使用TfidfVectorizer来代替CountVectorizer和TfidfTransformer。还有一件事,请确保ItemSelector返回的数据是二维形状(n_样本,n_特征)。您可以发布一些复制错误的示例数据吗?另外,
integer_特征
ItemSelector的输出形状是什么?这似乎是个问题,这些是测试序列拆分前的形状
(108129,7)。
管道将尝试对传入的数据应用较低的方法,并返回一个数据数组,实际上会在(x)中给出此错误
~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/feature\u extraction/text.py205 206 if self.lowercase:-->207返回lambda x:strip_accents(x.lower())208 else:209返回strip_accents AttributeError:'numpy.ndarray'对象没有属性'lower'
我还遇到了数据帧和scikit学习的问题。然而,正如你所看到的,让他们一起工作并不难。特别是,您将在那里找到DataFrameFeatureUnion转换器
pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer', Pipeline([
                ('integer_features', ItemSelector(int_features)),
                ('array', DataFrameToArrayTransformer()),
            ])),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])