Pandas 熊猫数据帧上的自定义word2vec转换器并在FeatureUnion中使用_Pandas_Scikit Learn_Pipeline_Word2vec

Pandas 熊猫数据帧上的自定义word2vec转换器并在FeatureUnion中使用

pandas scikit-learn

Pandas 熊猫数据帧上的自定义word2vec转换器并在FeatureUnion中使用,pandas,scikit-learn,pipeline,word2vec,Pandas,Scikit Learn,Pipeline,Word2vec,对于下面的数据帧df，我想将type列转换为OneHotEncoding，并使用字典word2vec将word列转换为其向量表示形式。然后我想将两个变换的向量与count列连接起来，形成分类的最终特征 >>> df word type count 0 apple A 4 1 cat B 3 2 mountain C 1 >>> df.dtypes word o

对于下面的数据帧

df

，我想将

type

列转换为OneHotEncoding，并使用字典

word2vec

将

word

列转换为其向量表示形式。然后我想将两个变换的向量与

count

列连接起来，形成分类的最终特征

>>> df
       word type  count
0     apple    A      4
1       cat    B      3
2  mountain    C      1 

>>> df.dtypes
word       object
type     category
count       int64

>>> word2vec
{'apple': [0.1, -0.2, 0.3], 'cat': [0.2, 0.2, 0.3], 'mountain': [0.4, -0.2, 0.3]}

我定义了定制的

Transformer

，并使用

FeatureUnion

连接这些功能

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder

class w2vTransformer(TransformerMixin):

    def __init__(self,word2vec):
        self.word2vec = word2vec

    def fit(self,x, y=None):
        return self

    def wv(self, w):
        return self.word2vec[w] if w in self.word2vec else [0, 0, 0]

    def transform(self, X, y=None):
         return df['word'].apply(self.wv)

pipeline = Pipeline([
    ('features', FeatureUnion(transformer_list=[
        # Part 1: get integer column
        ('numericals', Pipeline([
            ('selector', TypeSelector(np.number)),
        ])),

        # Part 2: get category column and its onehotencoding
        ('categoricals', Pipeline([
            ('selector', TypeSelector('category')),
            ('labeler', StringIndexer()),
            ('encoder', OneHotEncoder(handle_unknown='ignore')),
        ])), 

        # Part 3: transform word to its embedding
        ('word2vec', Pipeline([
            ('w2v', w2vTransformer(word2vec)),
        ]))
    ])),
])

当我运行

pipeline.fit_transform（df）

时，我得到一个错误：

块[0，：]的行维度不兼容。已获取块[0,2]。形状[0]==1，应为3。

但是，如果我从管道中移除word2vec变压器（第3部分），管道（第1部分+第2部分）工作正常

>>> pipeline_no_word2vec.fit_transform(df).todense()
matrix([[4., 1., 0., 0.],
        [3., 0., 1., 0.],
        [1., 0., 0., 1.]])

如果我只在管道中保留w2v变压器，它也可以工作

>>> pipeline_only_word2vec.fit_transform(df)
array([list([0.1, -0.2, 0.3]), list([0.2, 0.2, 0.3]),
       list([0.4, -0.2, 0.3])], dtype=object)

我猜我的

w2vTransformer

课程有问题，但不知道如何解决。请提供帮助。

此错误是由于FeatureUnion希望其每个部分都有一个二维数组

现在，您的FeatureUnion的前两部分：-

'numericals'

和

'categoricals'

正在正确发送形状的二维数据（n个样本，n个特征）

n\u示例数据中的样本数

=3<代码>n_特性将取决于单个部件（如OneHotEncoder将在第二部分中更改它们，但在第一部分中为1）

但是第三部分

'word2vec'

返回一个pandas.Series对象，该对象具有1-d形状

（3，）

。FeatureUnion默认采用该形状（1，3），因此会抱怨它与其他块不匹配

所以你需要修正这个形状

现在，即使您只需在最后执行一个

重塑（）

，并将其更改为形状（3,1），您的代码也不会运行，因为该数组的内部内容是word2vec dict中的列表，这些列表未正确转换为二维数组。相反，它将成为一个列表数组

更换W2V变压器以纠正错误：

class w2vTransformer(TransformerMixin):
    ...
    ...
    def transform(self, X, y=None):
        return np.array([np.array(vv) for vv in X['word'].apply(self.wv)])

之后，管道将开始工作