CountVectorizer（）.fit in scikit学习Python给出内存错误_Python_Scikit Learn

CountVectorizer（）.fit in scikit学习Python给出内存错误

python scikit-learn

CountVectorizer（）.fit in scikit学习Python给出内存错误,python,scikit-learn,Python,Scikit Learn,我正在处理一个8类分类问题，训练集包含大约400000个标记实体，我正在使用CountVectorizer.fit（）对数据进行矢量化，但我遇到了一个内存错误，我尝试使用HashingVectorizer代替，但没有成功 path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry']) X = products.entry y = products.l

我正在处理一个8类分类问题，训练集包含大约400000个标记实体，我正在使用CountVectorizer.fit（）对数据进行矢量化，但我遇到了一个内存错误，我尝试使用HashingVectorizer代替，但没有成功

path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer()
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

您可以设置限制词汇表内存使用的

max\u功能。
正确的值实际上取决于任务，因此您应该将其视为一个超参数，并尝试对其进行调整。在NLP（英语）中，人们通常使用~10000作为词汇量。您也可以使用HashVectorizer
执行同样的操作，但您可能会面临哈希共谋的风险，这将导致多个字增加同一计数器
path = 'data/products.tsv' products = pd.read_table(path , header= None , names = ['label' , 'entry'])   
X = products.entry
y = products.label
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Vectorizing the Dataset
vect = CountVectorizer(max_features=10000)
vect.fit(X_train.values.astype('U'))
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)

您是否预处理文本数据？这里的要点是，你有很多样本，你没有为你的计数器提供任何措辞，也没有使用“停止词”列表。因此，生成的向量是高维的，因为您有一个400k的示例，如果您的笔记本电脑没有足够的内存，您的内存错误您的HashVectorizer有内存错误？@MMF我没有预处理数据，接下来我会记住这一点time@elyase是的，我试过使用HashVectorizer，但它也给了我同样的错误。@IbrahimSharaf Skleaner使用优化的操作（稀疏表示、迭代器等），但如果您的数据量真的很大，它仍然有问题。所以我建议您尽可能多地清理数据，然后重新运行代码；）