Python scikit学习中不同数据类型的自定义管道_Python_Pandas_Numpy_Machine Learning_Scikit Learn

Python scikit学习中不同数据类型的自定义管道

python pandas numpy machine-learning scikit-learn

Python scikit学习中不同数据类型的自定义管道,python,pandas,numpy,machine-learning,scikit-learn,Python,Pandas,Numpy,Machine Learning,Scikit Learn,我目前正试图预测kickstarter项目是否会成功，这取决于一组整数和一些文本特性。我正在考虑建造一条类似这样的管道参考：这是我的ItemSelector和管道代码 class ItemSelector(BaseEstimator, TransformerMixin): def __init__(self, keys): self.keys = keys def fit(self, x, y=None): return self

我目前正试图预测kickstarter项目是否会成功，这取决于一组整数和一些文本特性。我正在考虑建造一条类似这样的管道

参考：

这是我的ItemSelector和管道代码

class ItemSelector(BaseEstimator, TransformerMixin):    
    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.keys]

我已验证ItemSelector是否按预期工作

t = ItemSelector(['cleaned_text'])
t.transform(df)

And it extract the necessary columns

管道但是当我运行pipeline.fit（X_-train，y_-train）时，我收到了这个错误。你知道怎么解决这个问题吗

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    266             This estimator
    267         """
--> 268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
    270             self._final_estimator.fit(Xt, y, **fit_params)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
    232                 pass
    233             elif hasattr(transform, "fit_transform"):
--> 234                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    235             else:
    236                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    740         self._update_transformer_list(transformers)
    741         if any(sparse.issparse(f) for f in Xs):
--> 742             Xs = sparse.hstack(Xs).tocsr()
    743         else:
    744             Xs = np.hstack(Xs)

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    456 
    457     """
--> 458     return bmat([blocks], format=format, dtype=dtype)
    459 
    460 

~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    577                                                     exp=brow_lengths[i],
    578                                                     got=A.shape[0]))
--> 579                     raise ValueError(msg)
    580 
    581                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.

---------------------------------------------------------------------------
ValueError回溯（最近一次调用上次）
在（）
---->1.管道安装（X_系列、y_系列）
~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py-in-fit（self，X，y，**fit_参数）
266这个估计器
267         """
-->268 Xt，拟合参数=自拟合（X，y，**拟合参数）
269如果self.\u final\u估计器不是无：
270自我最终估计值拟合（Xt，y，**拟合参数）
~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in_fit（self，X，y，**fit_参数）
232通行证
233 elif hasattr（变换，“拟合变换”）：
-->234 Xt=transform.fit_transform（Xt，y，**fit_参数_步骤[名称]）
235其他：
236 Xt=transform.fit（Xt，y，**fit_参数_步骤[名称]）\
拟合转换中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py（self，X，y，**拟合参数）
740自我更新变压器列表（变压器）
741如果有的话（对于Xs中的f，稀疏.issparse（f））：
-->742xs=sparse.hstack（Xs.tocsr（））
743其他：
744 Xs=np.hstack（Xs）
hstack中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py（块、格式、数据类型）
456
457     """
-->458返回bmat（[blocks]，format=format，dtype=dtype）
459
460
bmat中的~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py（块、格式、数据类型）
577 exp=眉毛长度[i]，
578 got=A.shape[0]））
-->579提升值错误（msg）
580
581如果bcol_长度[j]==0：
ValueError:块[0，：]的行维度不兼容。已获取块[0,1]。形状[0]==81096，应为1。

ItemSelector返回的是数据帧，而不是数组。这就是为什么

scipy.hstack

抛出错误的原因。更改ItemSelector，如下所示：

class ItemSelector(BaseEstimator, TransformerMixin):    
    ....
    ....
    ....

    def transform(self, data_dict):
        return data_dict[self.keys].as_matrix()

该错误发生在管道的

integer\u功能

部分。对于第一部分

text

，ItemSelector下面的转换器支持数据帧，因此可以正确地将其转换为数组。但是第二部分只有ItemSelector并返回Dataframe

更新：

在注释中，您提到要对ItemSelector返回的结果数据帧执行一些操作。因此，您可以创建一个新的转换器并将其附加到管道的第二部分，而不是修改ItemSelector的transform方法

class DataFrameToArrayTransformer(BaseEstimator, TransformerMixin):    
    def __init__(self):

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return X.as_matrix()

那么您的管道应该如下所示：

pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer', Pipeline([
                ('integer_features', ItemSelector(int_features)),
                ('array', DataFrameToArrayTransformer()),
            ])),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])

这里要了解的主要问题是，FeatureUnion在组合二维数组时只处理二维数组，因此任何其他类型（如DataFrame）都可能会出现问题

您应该发布错误的完整堆栈跟踪。也可以单独使用TfidfVectorizer来代替CountVectorizer和TfidfTransformer。还有一件事，请确保ItemSelector返回的数据是二维形状（n_样本，n_特征）。您可以发布一些复制错误的示例数据吗？另外，

integer_特征

ItemSelector的输出形状是什么？这似乎是个问题，这些是测试序列拆分前的形状

（108129，7）。

管道将尝试对传入的数据应用较低的方法，并返回一个数据数组，实际上会在（x）中给出此错误

~/Anaconda/Anaconda/envs/ds/lib/python3.5/site-packages/sklearn/feature\u extraction/text.py205 206 if self.lowercase:-->207返回lambda x:strip_accents（x.lower（））208 else:209返回strip_accents AttributeError:'numpy.ndarray'对象没有属性'lower'

我还遇到了数据帧和scikit学习的问题。然而，正如你所看到的，让他们一起工作并不难。特别是，您将在那里找到DataFrameFeatureUnion转换器

pipeline = Pipeline([
    # Use FeatureUnion to combine the features from subject and body
    ('union', FeatureUnion(
        transformer_list=[
            # Pipeline for pulling features from the post's subject line
            ('text', Pipeline([
                ('selector', ItemSelector(['cleaned_text'])),
                ('counts', CountVectorizer()),
                ('tf_idf', TfidfTransformer())
            ])),

            # Pipeline for pulling ad hoc features from post's body
            ('integer', Pipeline([
                ('integer_features', ItemSelector(int_features)),
                ('array', DataFrameToArrayTransformer()),
            ])),
        ]
    )),

    # Use a SVC classifier on the combined features
    ('svc', SVC(kernel='linear')),
])