Python N-Grams到数组_Python_Machine Learning_Scikit Learn

Python N-Grams到数组

python machine-learning scikit-learn

Python N-Grams到数组,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,在我的论文中，我正在使用Python进行一个机器学习项目，其中包括从文本中提取特征。首先，我尝试使用sci工具包实现bi-gram 现在，当我通过Countvectorizer处理数据时，我得到的数组只有1，有时甚至更多。例如： `[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]` 我想用这些双图来预测我的目标变量，它是分类的。

在我的论文中，我正在使用Python进行一个机器学习项目，其中包括从文本中提取特征。首先，我尝试使用sci工具包实现bi-gram

现在，当我通过

Countvectorizer

处理数据时，我得到的数组只有

，有时甚至更多。例如：

`[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]`

我想用这些双图来预测我的目标变量，它是分类的。当我现在执行代码时，Python返回两个数组的形状不相同

`[[1 3 2 ..., 1 1 1]] [ 0.  0.  1.  0.  0.]`

有人能告诉我我做错了什么吗？我正在对bi程序使用此命令。第一部分是数据集中每个文本（电影情节）的循环

        plottext = [ row[8] ]
        wordvec = CountVectorizer(ngram_range=(2,2), analyzer='word')
        plotvec = wordvec.fit_transform(plottext).toarray()
        matrix_terms = np.array(wordvec.get_feature_names())
        matrix_freq = np.asarray(plotvec.sum(axis=0)).ravel()
        final_matrix = np.array([matrix_terms,matrix_freq])
        target = { 'Age': row[4] }
        data.append((final_matrix, target))
# Convert categorial target variable to Y
(X, Ycat) = zip(*data)
vec = DictVectorizer(sparse=False)
Y = vec.fit_transform(Ycat)
#Extract textual features from plot
return (X, Y)

我收到的错误消息

ValueError: could not broadcast input array from shape (2,830) into shape (2)

请在

plotvec=wordvec.fit\u transform（plottext.toarray（）

之后发布一些代码。我添加了更多代码，谢谢。错误在哪里抛出？消息有点像分类器得到的X和Y可能有不同的长度。尝试类似于

打印len（X），len（Y）

的方法。当我将其加载到分类器中时，会抛出错误。X和Y的长度相等。