Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/358.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python Mapper Reducer中使用CountVectorizer_Python_Scikit Learn_Tokenize_Mapper_Reducers - Fatal编程技术网

在Python Mapper Reducer中使用CountVectorizer

在Python Mapper Reducer中使用CountVectorizer,python,scikit-learn,tokenize,mapper,reducers,Python,Scikit Learn,Tokenize,Mapper,Reducers,我正在尝试使用python mapper reducer函数应用标记器。我有以下代码,但我不断得到错误。reducer在列表中输出值,我将值传递给矢量器 from mrjob.job import MRJob from sklearn.cross_validation import train_test_split from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import C

我正在尝试使用python mapper reducer函数应用标记器。我有以下代码,但我不断得到错误。reducer在列表中输出值,我将值传递给矢量器

from mrjob.job import MRJob
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

class bagOfWords(MRJob):

def mapper(self, _, line):
    cat, phrase, phraseid, sentiment = line.split(',')
    yield (cat, phraseid, sentiment), phrase

def reducer(self, keys, values):

    yield keys, list(values)

#Output: ["Train", "--", "2"] ["A series of escapades demonstrating the adage that    what is good for the goose", "A series", "A", "series"]

def mapper(self, keys, values):
    vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(values)
    x = vectorizer.transform(values)
    x=x.toarray()       
    yield keys, (x)


if __name__ == '__main__':
    bagOfWords.run()
ValueError:空词汇表;也许文档中只包含停止词


感谢大家提供的帮助。

CountVectorizer是有状态的:您需要在完整数据集上匹配同一个实例来构建词汇表,因此这不适合并行处理

相反,您可以使用无状态的
hashingvectorier
(无需调整,您可以直接调用
transform