运行nltk分类器时出现Python内存错误

运行nltk分类器时出现Python内存错误,python,memory-management,out-of-memory,classification,nltk,Python,Memory Management,Out Of Memory,Classification,Nltk,我在大量文本上运行分类器,这导致了内存错误问题。Python使用大约2gb的内存,然后返回错误 我知道加载这么多数据然后试图处理它会导致错误,我只是不知道如何解决,我对python非常陌生。我想我需要“分块”文本输入或逐行处理文本,但我仍然不确定如何在我的代码中真正实现这一点。任何帮助都将是惊人的 守则: import nltk, pickle from nltk.corpus import stopwords customstopwords = [] p = open('', 'r')

我在大量文本上运行分类器,这导致了内存错误问题。Python使用大约2gb的内存,然后返回错误

我知道加载这么多数据然后试图处理它会导致错误,我只是不知道如何解决,我对python非常陌生。我想我需要“分块”文本输入或逐行处理文本,但我仍然不确定如何在我的代码中真正实现这一点。任何帮助都将是惊人的

守则:

import nltk, pickle
from nltk.corpus import stopwords


customstopwords = []

p = open('', 'r')
postxt = p.readlines()

n = open('', 'r')
negtxt = n.readlines()

neglist = []
poslist = []

for i in range(0,len(negtxt)):
    neglist.append('negative')

for i in range(0,len(postxt)):
    poslist.append('positive')

postagged = zip(postxt, poslist)
negtagged = zip(negtxt, neglist)

print "STAGE ONE" 

taggedtweets = postagged + negtagged

tweets = []

for (word, sentiment) in taggedtweets:
    word_filter = [i.lower() for i in word.split()]
    tweets.append((word_filter, sentiment))

def getwords(tweets):
    allwords = []
    for (words, sentiment) in tweets:
            allwords.extend(words)
    return allwords

def getwordfeatures(listoftweets):
    wordfreq = nltk.FreqDist(listoftweets)
    words = wordfreq.keys()
    return words

wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in stopwords.words('english')]
wordlist = [i for i in getwordfeatures(getwords(tweets)) if not i in customstopwords]

print "STAGE TWO"

def feature_extractor(doc):
    docwords = set(doc)
    features = {}
    for i in wordlist:
        features['contains(%s)' % i] = (i in docwords)
    return features

print "STAGE THREE"

training_set = nltk.classify.apply_features(feature_extractor, tweets)

print "STAGE FOUR"

classifier = nltk.NaiveBayesClassifier.train(training_set)

print "STAGE FIVE"      

f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()

如果您在windows上运行32位python,这就是您的问题所在,并且它通常会在略低于2GB时崩溃。通过使用64位python,您可以获得更多可用内存。下载并立即尝试似乎已经对其进行了排序,尽管我认为这可能不是最好的方式,如果有人关心的话,内存峰值为3GB。