Python sklearn中的流水线问题

Python sklearn中的流水线问题,python,python-2.7,machine-learning,scikit-learn,Python,Python 2.7,Machine Learning,Scikit Learn,我是个新手。我使用管道将矢量器和分类器一起用于文本挖掘问题。这是我的密码: def create_ngram_model(): tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3), analyzer="word", binary=False) clf = GaussianNB() pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)]) return pipeline def get

我是个新手。我使用管道将矢量器和分类器一起用于文本挖掘问题。这是我的密码:

def create_ngram_model():
tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False)
clf = GaussianNB()
pipeline = Pipeline([('vect', tfidf_ngrams), ('clf', clf)])
return pipeline


def get_trains():
    data=open('../cleaning data/cleaning the sentences/cleaned_comments.csv','r').readlines()[1:]
    lines=len(data)
    features_train=[]
    labels_train=[]
    for i in range(lines):
        l=data[i].split(',')
        labels_train+=[int(l[0])]
        a=l[2]
        features_train+=[a]
    return features_train,labels_train

def train_model(clf_factory,features_train,labels_train):
    features_train,labels_train=get_trains()
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(features_train, labels_train, test_size=0.1, random_state=42)
    clf=clf_factory()
    clf.fit(features_train,labels_train)
    pred = clf.predict(features_test)
    accuracy = accuracy_score(pred,labels_test)
    return accuracy

X,Y=get_trains()
print train_model(create_ngram_model,X,Y)
从get_trains返回的特性是字符串。 我得到了这个错误

clf.fit(features_train,labels_train)
  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 149, in fit
    X, y = check_arrays(X, y, sparse_format='dense')
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 263, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
我已经多次遇到这个错误。然后,我只是将特征更改为features_transformed.toarray,但由于在这里使用管道,我无法这样做,因为转换后的特征会自动返回。我还尝试创建一个新类,该类返回特性_transformed.toarray,但也出现了同样的错误。 我找了很多,但没有找到。请帮忙

有两种选择:

使用稀疏数据兼容分类器。例如,文档中说明并支持稀疏的fit输入

向管道中添加增稠剂。很明显,你搞错了,当我需要一路加密我的稀疏数据时,这个方法对我有效:

class Densifier(object):
    def fit(self, X, y=None):
        pass
    def fit_transform(self, X, y=None):
        return self.transform(X)
    def transform(self, X, y=None):
        return X.toarray()
确保在分类机之前将其放入管道


您好,我已将管道更改为管道=管道['vect',tfidf_ngrams',densitify',densitifier',clf',clf]。它仍在给出错误-Xt=transform.fit_transformXt,y,**fit_params_steps[name]TypeError:当第一个参数得到csr_矩阵实例时,必须使用densitifier实例调用未绑定的方法fit_transforminstead@user2443048,哪个错误?现在,通过包含类而不是实例对象,中断了管道。用Densifier替换Densifier现在出现内存错误。我认为,我应该使用SelectPercentile来选择10-20%的最佳功能,但实际上我不知道如何将其与管道耦合。根据文档,SelectPercentile不适用于稀疏数据。啊,它起作用了。我刚刚选择了5%的功能,并将其添加到densifier之前的管道中。谢谢