Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/blackberry/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 在scikit学习中使用FeatureUnion定制变压器混合_Python_Numpy_Scikit Learn - Fatal编程技术网

Python 在scikit学习中使用FeatureUnion定制变压器混合

Python 在scikit学习中使用FeatureUnion定制变压器混合,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我在scikit learn中编写自定义变压器,以便对阵列执行特定操作。为此,我使用TransformerMixin类的继承。 当我只处理一个变压器时,它工作正常。 但是,当我尝试使用FeatureUnion或make_union链接它们时,数组被复制n次。 我能做些什么来避免这种情况? 我是否在使用scikit学习它应该是什么样子 import numpy as np from sklearn.base import TransformerMixin from sklearn.pipeline

我在scikit learn中编写自定义变压器,以便对阵列执行特定操作。为此,我使用TransformerMixin类的继承。 当我只处理一个变压器时,它工作正常。 但是,当我尝试使用FeatureUnion或make_union链接它们时,数组被复制n次。 我能做些什么来避免这种情况? 我是否在使用scikit学习它应该是什么样子

import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion

# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')

# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    def transform(self, X):
        # appends a column (in this case, a constant) to X
        s = np.full(X.shape[0], self.value)
        X = np.column_stack([X, s])
        return X

# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')

# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated

# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')
输出:

base array: 
 [['foo' 'a']
 ['bar' 'b']
 ['baz' 'c']] 

single transformer: 
 [['foo' 'a' '1']
 ['bar' 'b' '1']
 ['baz' 'c' '1']] 

Given result of the Feature union pipeline: 
 [['foo' 'a' '1' 'foo' 'a' '2']
 ['bar' 'b' '1' 'bar' 'b' '2']
 ['baz' 'c' '1' 'baz' 'c' '2']] 

Expected result of the Feature Union pipeline: 
   [['foo' 'a' '1' '2']
   ['bar' 'b' '1' '2']
   ['baz' 'c' '1' '2']] 
非常感谢

FeatureUnion将把它从内部变压器得到的东西连接起来。现在,在您的内部变压器中,您正在从每个变压器发送相同的列。这取决于变压器是否正确地向前发送正确的数据

我建议您只从内部转换器返回新数据,然后从FeatureUnion外部或内部连接其余列

如果您还没有:

您可以这样做,例如:

# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X

# Your transformer
class DummyTransformer(TransformerMixin):
    def __init__(self, value=None):
        TransformerMixin.__init__(self)
        self.value = value

    def fit(self, *_):
        return self

    # Changed this to only return new column after some operation on X
    def transform(self, X):
        s = np.full(X.shape[0], self.value)
        return s.reshape(-1,1)
在此之后,在代码的下面,更改以下内容:

stages = []    

# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))


for i in range(2):
    transfo = DummyTransformer(value=i+1)
    stages.append(('step'+str(i+1),transfo))

pipeunion = FeatureUnion(stages)
运行此新代码会产生以下结果:

('Given result of the Feature union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n', 
array([['foo', 'a', '1', '2'],
       ['bar', 'b', '1', '2'],
       ['baz', 'c', '1', '2']], dtype='|S21'), '\n')

非常感谢。我对Spark ML transformers的行为感到困惑,他们将整个数据+新数据作为输出。