优化内存使用-Pandas/Python_Python_Pandas_Memory Management_Out Of Memory

优化内存使用-Pandas/Python

python pandas memory-management

优化内存使用-Pandas/Python,python,pandas,memory-management,out-of-memory,Python,Pandas,Memory Management,Out Of Memory,我目前正在处理一个包含原始文本的数据集，我应该对其进行预处理： from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import SnowballStemmer from nltk.stem.wordnet import WordNetLemmatizer stop_words = set(stopwords.words('english')) stemmer = S

我目前正在处理一个包含原始文本的数据集，我应该对其进行预处理：

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

from autocorrect import spell

for df in [train_df, test_df]:
    df['comment_text'] = df['comment_text'].apply(lambda x: word_tokenize(str(x)))
    df['comment_text'] = df['comment_text'].apply(lambda x: [lemma.lemmatize(spell(word)) for word in x])
    df['comment_text'] = df['comment_text'].apply(lambda x: ' '.join(x))

但是，包括

拼写功能

会提高内存使用率，直到出现“内存错误”。如果不使用这样的函数，就不会发生这种情况。我想知道是否有办法优化这个过程，保持

拼写功能（数据集有很多拼写错误的单词）
我没有访问您的数据帧的权限，因此这有点推测性，但接下来
DataFrame.apply
将立即对整个列运行lambda
函数，因此它可能在内存中保存进度。相反，您可以将lambda函数转换为预定义函数，并改用DataFrame.map
，这将应用函数元素
def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

你能试试看它是否有用吗？
我没有访问你的数据帧的权限，所以这有点推测性，但现在
DataFrame.apply
将立即对整个列运行lambda
函数，因此它可能在内存中保存进度。相反，您可以将lambda函数转换为预定义函数，并改用DataFrame.map
，这将应用函数元素
def spellcheck_string(input_str):
    return [lemma.lemmatize(spell(word)) for word in x]

for df in [train_df, test_df]:
   # ...
    df['comment_text'] = df['comment_text'].map(spellcheck_string)
   # ...

您能试试看它是否有用吗？
无论如何，我会使用dask，您可以将数据帧划分为块（分区），然后检索每个部分并使用它
无论如何，我会使用dask，您可以将数据帧划分为块（分割），然后检索每个部分并使用它
哦，太好了，我不知道这件事！我要试试看！谢谢。内存又达到了11GB。现在是13岁。我试过一次，我的电脑冻结了。你的数据集有多大？列中项目的大致数量，以及[“注释文本”]字段的典型大小？我很乐意根据一些虚假数据提供帮助。可能是分批操作数据是前进的方向。我正在使用此数据集：。谢谢你的好意。我正在另一次运行代码，我可以检查行而不中断它。*无法检查行数。我现在的方法是并行化。哦，太好了，我不知道这个！我要试试看！谢谢。内存又达到了11GB。现在是13岁。我试过一次，我的电脑冻结了。你的数据集有多大？列中项目的大致数量，以及[“注释文本”]字段的典型大小？我很乐意根据一些虚假数据提供帮助。可能是分批操作数据是前进的方向。我正在使用此数据集：。谢谢你的好意。我正在另一次运行代码，我可以检查行而不中断它。*无法检查行数。我现在的方法是将其并行化。