Python 基于NLTK的高效术语文档矩阵_Python_Pandas_Nltk_Term Document Matrix

Python 基于NLTK的高效术语文档矩阵

python pandas

Python 基于NLTK的高效术语文档矩阵,python,pandas,nltk,term-document-matrix,Python,Pandas,Nltk,Term Document Matrix,我正在尝试使用NLTK和pandas创建术语文档矩阵。我编写了以下函数： def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDis

我正在尝试使用NLTK和pandas创建术语文档矩阵。我编写了以下函数：

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

运行它

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

它适用于语料库中的一些小文件，但当我尝试使用4000个文件（每个文件大约2 kb）的语料库运行它时，它会给我一个记忆错误
我遗漏了什么吗？

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer def fn_tdm_df(docs, xColNames = None, **kwargs): ''' create a term document matrix as pandas DataFrame with **kwargs you can pass arguments of CountVectorizer if xColNames is given the dataframe gets columns Names''' #initialize the vectorizer vectorizer = CountVectorizer(**kwargs) x1 = vectorizer.fit_transform(docs) #create dataFrame df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names()) if xColNames is not None: df.columns = xColNames return df
我使用的是32位python。
（windows 7上的am，64位操作系统，核心四CPU，8 GB RAM）。对于这种大小的语料库，我真的需要使用64位吗
多亏了拉迪姆和拉斯曼。我的目标是要有一个DTM，就像你在R tm中得到的一样。我决定使用scikit学习，部分灵感来自。这就是我想出的代码
我把它贴在这里，希望其他人会发现它有用。

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer def fn_tdm_df(docs, xColNames = None, **kwargs): ''' create a term document matrix as pandas DataFrame with **kwargs you can pass arguments of CountVectorizer if xColNames is given the dataframe gets columns Names''' #initialize the vectorizer vectorizer = CountVectorizer(**kwargs) x1 = vectorizer.fit_transform(docs) #create dataFrame df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names()) if xColNames is not None: df.columns = xColNames return df
在目录中的文本列表上使用它

DIR = 'C:/Data/' def fn_CorpusFromDIR(xDIR): ''' functions to create corpus from a Directories Input: Directory Output: A dictionary with Names of files ['ColNames'] the text in corpus ['docs']''' import os Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)], ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR))) return Res
创建数据帧的步骤
我知道OP想要在NLTK中创建tdm，但是
textmining
包（
pip install textmining
）让它非常简单：

import textmining # Create some very short sample documents doc1 = 'John and Bob are brothers.' doc2 = 'John went to the store. The store was closed.' doc3 = 'Bob went to the store too.' # Initialize class to create term-document matrix tdm = textmining.TermDocumentMatrix() # Add the documents tdm.add_doc(doc1) tdm.add_doc(doc2) tdm.add_doc(doc3) # Write matrix file -- cutoff=1 means words in 1+ documents are retained tdm.write_csv('matrix.csv', cutoff=1) # Instead of writing the matrix, access its rows directly for row in tdm.rows(cutoff=1): print row
输出：

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too'] [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0] [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0] [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

hello omg pony she there went why 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 2 0 1 0 1 1 1 0
或者，可以使用pandas和sklearn：
输出：

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too'] [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0] [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0] [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

hello omg pony she there went why 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 2 0 1 0 1 1 1 0

使用令牌和数据帧的替代方法

import nltk comment #nltk.download() to get toenize from urllib import request url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = request.urlopen(url) raw = response.read().decode('utf8') type(raw) tokens = nltk.word_tokenize(raw) type(tokens) tokens[1:10] ['Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by'] tokens2=pd.DataFrame(tokens) tokens2.columns=['Words'] tokens2.head() Words 0 The 1 Project 2 Gutenberg 3 EBook 4 of tokens2.Words.value_counts().head() , 16178 . 9589 the 7436 and 6284 to 5278

您是否尝试过
gensim
或类似的库，它们为tf-idf优化了代码？4000个文件是一个很小的语料库。你需要一个代理。熊猫有这些，Gensim和scikit也知道。我认为pd.get_dummies（df_专栏）可以做这项工作。也许我遗漏了文档术语matrixI在运行代码时出错：导入词干分析器ImportError：没有名为“词干分析器”的模块如何修复它？我已经试过pip安装词干分析器了。你使用的是什么版本的Python？textmining包中可能有一个正在运行的词干分析器模块导入。我刚刚运行了
pip install textmining
，然后在2.7.9上运行了上面的代码，得到了预期的输出。我运行了
pip安装textmining
。我复制并运行了原样的代码。textmining模块可能对Python2.7有严格的依赖性。能否尝试
conda create-n myvirtualenv python=2.7
然后
source激活myvirtualenv
然后在conda环境中重复pip安装并重试脚本？一旦您使用完环境，只需键入
source deactivate
，然后您就可以访问系统级python 3.5环境是的，我认为Python3用户会遇到问题。我为此提出了诉讼。