Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/353.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 矢量器对象toArray(),数组太大错误_Python_Numpy_Scikit Learn - Fatal编程技术网

Python 矢量器对象toArray(),数组太大错误

Python 矢量器对象toArray(),数组太大错误,python,numpy,scikit-learn,Python,Numpy,Scikit Learn,我已经创建了预处理的数据。现在,我想将其矢量化并将其写入文本文件。在将矢量器对象转换为数组时,我得到了这个错误。可能的解决办法是什么 from sklearn.feature_extraction.text import CountVectorizer import numpy as np vectorizer = CountVectorizer(analyzer = "word", \ tokenizer =

我已经创建了预处理的数据。现在,我想将其矢量化并将其写入文本文件。在将矢量器对象转换为数组时,我得到了这个错误。可能的解决办法是什么

from sklearn.feature_extraction.text import CountVectorizer
    import numpy as np
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 max_features = 1000)
    newTestFile = open("testfile.txt", 'r', encoding='latin-1')
    featureVector=vectorizer.fit_transform(newTestFile)
    train_data_features = featureVector.toarray()
    np.savetxt('plotFeatureVector.txt', train_data_features, fmt="%10s %10.3f")

The error:

    Traceback (most recent call last):
      File "C:/Users/NuMA/Desktop/Lecture Stuff/EE 485/Project/Deneme/bagOfWords.py", line 12, in <module>
        train_data_features = featureVector.toarray()
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\compressed.py", line 964, in toarray
        return self.tocoo(copy=False).toarray(order=order, out=out)
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\coo.py", line 252, in toarray
        B = self._process_toarray_args(order, out)
      File "C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\base.py", line 1039, in _process_toarray_args
        return np.zeros(self.shape, dtype=self.dtype, order=order)
    ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
来自sklearn.feature\u extraction.text import countvectorier
将numpy作为np导入
矢量器=计数矢量器(analyzer=“word”\
标记器=无\
预处理器=无\
停止单词=无\
最大(最大功能=1000)
newTestFile=open(“testfile.txt,'r',encoding='latin-1')
featureVector=vectorizer.fit_变换(newTestFile)
列车数据特征=featureVector.toarray()
np.savetxt('plotFeatureVector.txt',序列数据特征,fmt=“%10s%10.3f”)
错误:
回溯(最近一次呼叫最后一次):
文件“C:/Users/NuMA/Desktop/touch Stuff/EE 485/Project/Deneme/bagOfWords.py”,第12行,在
列车数据特征=featureVector.toarray()
文件“C:\Users\NuMA\AppData\Local\Programs\Python\35-32\lib\site packages\scipy\sparse\compressed.py”,第964行,位于toarray中
返回self.tocoo(copy=False).toarray(order=order,out=out)
文件“C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site packages\scipy\sparse\coo.py”,第252行,位于toarray中
B=自。\处理\到阵列\参数(订单、输出)
文件“C:\Users\NuMA\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scipy\sparse\base.py”,第1039行,位于\u process\u toarray\u args中
返回np.zero(self.shape,dtype=self.dtype,order=order)
ValueError:数组太大`arr.size*arr.dtype.itemsize`大于最大可能大小。

矢量器
创建了一个大型稀疏矩阵,
特征向量

featureVector.toarray()
(我通常使用
featureVector.A
)应该从中创建一个密集(常规
numpy
)数组。显然,要求的尺寸太大了

能否打印
repr(featureVector)
?这应该显示该矩阵的非零项的形状、数据类型和数量。我猜它有数百万行和数千列

因此,即使它确实有效,我也怀疑带有
fmt=“%10s%10.3f”的
savetxt
是否有效。或者
csv`这样一个大数组的文件是可用的


因此,请确保您了解
矢量器产生的内容。并重新考虑从结果创建密集数组并保存它的任务。

您不是在转换矢量器对象,
featureVector
是一个稀疏矩阵。特别是可能重复的,您应该使用np.savez/np.load方法来回答dupe-target。最新的
scipy.sparse
(1.19?)有一对类似于savez的函数。nnz=12110452,shape=(2909881000),dtype=int64,format=csr。我知道它是一个很大的矩阵,但在数组上写它应该不会太难。你试过直接把
np.zeros(…)
做得那么大吗?