Python 类型错误：稀疏矩阵长度不明确；使用RF分类器时是否使用getnnz（）或形状[0]？_Python_Numpy_Machine Learning_Nlp_Scikit Learn

Python 类型错误：稀疏矩阵长度不明确；使用RF分类器时是否使用getnnz（）或形状[0]？

python numpy machine-learning nlp scikit-learn

Python 类型错误：稀疏矩阵长度不明确；使用RF分类器时是否使用getnnz（）或形状[0]？,python,numpy,machine-learning,nlp,scikit-learn,Python,Numpy,Machine Learning,Nlp,Scikit Learn,我在scikit学习中学习随机森林，作为一个例子，我想使用随机森林分类器进行文本分类，并使用我自己的数据集。因此，首先我使用tfidf对文本进行矢量化并进行分类： from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train) prediction = classif

我在scikit学习中学习随机森林，作为一个例子，我想使用随机森林分类器进行文本分类，并使用我自己的数据集。因此，首先我使用tfidf对文本进行矢量化并进行分类：

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10) 
classifier.fit(X_train, y_train)           
prediction = classifier.predict(X_test)

当我运行分类时，我得到以下信息：

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
TypeError: object of type 'int' has no len()

然后我将

.toarray（）

用于

X\u列车

，得到以下结果：

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

prediction = classifier.predict(X_train.getnnz())

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

根据我之前的理解，我需要降低numpy数组的维数，以便执行相同的操作：

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing)

然后我得到了一个例外：

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

我尝试了以下方法：

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

prediction = classifier.predict(X_train.getnnz())

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

得到这个：

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
TypeError: object of type 'int' has no len()

由此引发了两个问题：如何使用随机森林进行正确分类？那么X_列车发生了什么

然后我尝试了以下方法：

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

prediction = classifier.predict(X_train.getnnz())

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

如果将相同的数据结构（类型和形状）传递给分类器的

fit

方法和

predict

方法，则有点不清楚。随机森林将需要很长时间来运行大量的功能，因此建议减少您链接到的帖子中的维度

您应该将SVD应用于训练和测试数据，以便分类器在与您希望预测的数据相同的形状输入上进行训练。检查拟合的输入，预测方法的输入具有相同数量的特征，并且都是数组而不是稀疏矩阵

更新示例： 更新为使用数据帧

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer(  use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.cross_validation import train_test_split

df= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
              ,'class': [0,0,0,1,1,1,0,3]})



X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=2)                                
X_reduced_train = pca.fit_transform(X)  

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray())

请注意，SVD发生在分解为训练集和测试集之前，因此传递给预测器的数组与调用

fit

方法的数组具有相同的

。

我对

sklearn

知之甚少，尽管我模糊地回忆起切换到使用稀疏矩阵时触发的一些早期问题。在内部，一些矩阵必须替换为

m.toarray（）

或

m.todense（）

但是为了让您了解错误消息是关于什么的，请考虑

In [907]: A=np.array([[0,1],[3,4]])
In [908]: M=sparse.coo_matrix(A)
In [909]: len(A)
Out[909]: 2
In [910]: len(M)
...
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [911]: A.shape[0]
Out[911]: 2
In [912]: M.shape[0]
Out[912]: 2

len（）

通常在Python中用于计算列表中第一级术语的数量。应用于二维阵列时，它是行数。但是

A.shape[0]

是计算行数的更好方法。和

M.shape[0]

是相同的。在本例中，您对

.getnnz

不感兴趣，它是稀疏矩阵的非零项数

没有此方法，但是可以从

A.nonzero（）

派生。在我的回答中添加了可复制的代码不需要调用类标签上的矢量器<代码>X\u测试\u r=tfidf\u向量变换（df['Label']）。这应该只是一个标签数组。您还需要将类标签作为第二个参数传递给

train\u test\u split

，感谢您的编辑和示例，我尝试将其用于我的案例，但仍然存在问题。。也许我弄糊涂了，你知道我需要对数据做两次分割吗？。我编辑了..没有一个被拆分为训练和测试，而是在svd转换数据之后。您是否在跟踪中得到tfidfvectorizer的错误？您可以将pandas列转换为数组

X=df['string\u coloumn']。值

并将其传递给矢量器顺便说一句-别忘了增加svd的组件参数。对于玩具数据集，我将其设置为2。