Python 矢量器对象toArray(),数组太大错误
我已经创建了预处理的数据。现在,我想将其矢量化并将其写入文本文件。在将矢量器对象转换为数组时,我得到了这个错误。可能的解决办法是什么Python 矢量器对象toArray(),数组太大错误,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我已经创建了预处理的数据。现在,我想将其矢量化并将其写入文本文件。在将矢量器对象转换为数组时,我得到了这个错误。可能的解决办法是什么 from sklearn.feature_extraction.text import CountVectorizer import numpy as np vectorizer = CountVectorizer(analyzer = "word", \ tokenizer =
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 1000)
newTestFile = open("testfile.txt", 'r', encoding='latin-1')
featureVector=vectorizer.fit_transform(newTestFile)
train_data_features = featureVector.toarray()
np.savetxt('plotFeatureVector.txt', train_data_features, fmt="%10s %10.3f")
The error:
Traceback (most recent call last):
File "C:/Users/NuMA/Desktop/Lecture Stuff/EE 485/Project/Deneme/bagOfWords.py", line 12, in <module>
train_data_features = featureVector.toarray()
File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\compressed.py", line 964, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\coo.py", line 252, in toarray
B = self._process_toarray_args(order, out)
File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
来自sklearn.feature\u extraction.text import countvectorier
将numpy作为np导入
矢量器=计数矢量器(analyzer=“word”\
标记器=无\
预处理器=无\
停止单词=无\
最大(最大功能=1000)
newTestFile=open(“testfile.txt,'r',encoding='latin-1')
featureVector=vectorizer.fit_变换(newTestFile)
列车数据特征=featureVector.toarray()
np.savetxt('plotFeatureVector.txt',序列数据特征,fmt=“%10s%10.3f”)
错误:
回溯(最近一次呼叫最后一次):
文件“C:/Users/NuMA/Desktop/touch Stuff/EE 485/Project/Deneme/bagOfWords.py”,第12行,在
列车数据特征=featureVector.toarray()
文件“C:\Users\NuMA\AppData\Local\Programs\Python\35-32\lib\site packages\scipy\sparse\compressed.py”,第964行,位于toarray中
返回self.tocoo(copy=False).toarray(order=order,out=out)
文件“C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site packages\scipy\sparse\coo.py”,第252行,位于toarray中
B=自。\处理\到阵列\参数(订单、输出)
文件“C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\base.py”,第1039行,位于\u process\u toarray\u args中
返回np.zero(self.shape,dtype=self.dtype,order=order)
ValueError:数组太大`arr.size*arr.dtype.itemsize`大于最大可能大小。
矢量器
创建了一个大型稀疏矩阵,特征向量
featureVector.toarray()
(我通常使用featureVector.A
)应该从中创建一个密集(常规numpy
)数组。显然,要求的尺寸太大了
能否打印repr(featureVector)
?这应该显示该矩阵的非零项的形状、数据类型和数量。我猜它有数百万行和数千列
因此,即使它确实有效,我也怀疑带有fmt=“%10s%10.3f”的savetxt
是否有效。或者
csv`这样一个大数组的文件是可用的
因此,请确保您了解
矢量器产生的内容。并重新考虑从结果创建密集数组并保存它的任务。您不是在转换矢量器对象,featureVector
是一个稀疏矩阵。特别是可能重复的,您应该使用np.savez/np.load方法来回答dupe-target。最新的scipy.sparse
(1.19?)有一对类似于savez的函数。nnz=12110452,shape=(2909881000),dtype=int64,format=csr。我知道它是一个很大的矩阵,但在数组上写它应该不会太难。你试过直接把np.zeros(…)
做得那么大吗?