Python 基于NLTK的高效术语文档矩阵
我正在尝试使用NLTK和pandas创建术语文档矩阵。 我编写了以下函数:Python 基于NLTK的高效术语文档矩阵,python,pandas,nltk,term-document-matrix,Python,Pandas,Nltk,Term Document Matrix,我正在尝试使用NLTK和pandas创建术语文档矩阵。 我编写了以下函数: def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDis
def fnDTM_Corpus(xCorpus):
import pandas as pd
'''to create a Term Document Matrix from a NLTK Corpus'''
fd_list = []
for x in range(0, len(xCorpus.fileids())):
fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
DTM.fillna(0,inplace = True)
return DTM.T
运行它
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
x = fnDTM_Corpus(newcorpus)
它适用于语料库中的一些小文件,但当我尝试使用4000个文件(每个文件大约2 kb)的语料库运行它时,它会给我一个记忆错误
我遗漏了什么吗?
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def fn_tdm_df(docs, xColNames = None, **kwargs):
''' create a term document matrix as pandas DataFrame
with **kwargs you can pass arguments of CountVectorizer
if xColNames is given the dataframe gets columns Names'''
#initialize the vectorizer
vectorizer = CountVectorizer(**kwargs)
x1 = vectorizer.fit_transform(docs)
#create dataFrame
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
if xColNames is not None:
df.columns = xColNames
return df
我使用的是32位python。
(windows 7上的am,64位操作系统,核心四CPU,8 GB RAM)。对于这种大小的语料库,我真的需要使用64位吗 多亏了拉迪姆和拉斯曼。 我的目标是要有一个DTM,就像你在R tm中得到的一样。 我决定使用scikit学习,部分灵感来自。这就是我想出的代码 我把它贴在这里,希望其他人会发现它有用。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
def fn_tdm_df(docs, xColNames = None, **kwargs):
''' create a term document matrix as pandas DataFrame
with **kwargs you can pass arguments of CountVectorizer
if xColNames is given the dataframe gets columns Names'''
#initialize the vectorizer
vectorizer = CountVectorizer(**kwargs)
x1 = vectorizer.fit_transform(docs)
#create dataFrame
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
if xColNames is not None:
df.columns = xColNames
return df
在目录中的文本列表上使用它
DIR = 'C:/Data/'
def fn_CorpusFromDIR(xDIR):
''' functions to create corpus from a Directories
Input: Directory
Output: A dictionary with
Names of files ['ColNames']
the text in corpus ['docs']'''
import os
Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
return Res
创建数据帧的步骤
我知道OP想要在NLTK中创建tdm,但是
textmining
包(pip install textmining
)让它非常简单:
import textmining
# Create some very short sample documents
doc1 = 'John and Bob are brothers.'
doc2 = 'John went to the store. The store was closed.'
doc3 = 'Bob went to the store too.'
# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()
# Add the documents
tdm.add_doc(doc1)
tdm.add_doc(doc2)
tdm.add_doc(doc3)
# Write matrix file -- cutoff=1 means words in 1+ documents are retained
tdm.write_csv('matrix.csv', cutoff=1)
# Instead of writing the matrix, access its rows directly
for row in tdm.rows(cutoff=1):
print row
输出:
['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
hello omg pony she there went why
0 1 0 0 0 1 0 1
1 1 1 1 0 0 0 0
2 0 1 0 1 1 1 0
或者,可以使用pandas和sklearn:
输出:
['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]
hello omg pony she there went why
0 1 0 0 0 1 0 1
1 1 1 1 0 0 0 0
2 0 1 0 1 1 1 0
使用令牌和数据帧的替代方法
import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)
tokens = nltk.word_tokenize(raw)
type(tokens)
tokens[1:10]
['Project',
'Gutenberg',
'EBook',
'of',
'Crime',
'and',
'Punishment',
',',
'by']
tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()
Words
0 The
1 Project
2 Gutenberg
3 EBook
4 of
tokens2.Words.value_counts().head()
, 16178
. 9589
the 7436
and 6284
to 5278
您是否尝试过
gensim
或类似的库,它们为tf-idf优化了代码?4000个文件是一个很小的语料库。你需要一个代理。熊猫有这些,Gensim和scikit也知道。我认为pd.get_dummies(df_专栏)可以做这项工作。也许我遗漏了文档术语matrixI在运行代码时出错:导入词干分析器ImportError:没有名为“词干分析器”的模块如何修复它?我已经试过pip安装词干分析器了。你使用的是什么版本的Python?textmining包中可能有一个正在运行的词干分析器模块导入。我刚刚运行了pip install textmining
,然后在2.7.9上运行了上面的代码,得到了预期的输出。我运行了pip安装textmining
。我复制并运行了原样的代码。textmining模块可能对Python2.7有严格的依赖性。能否尝试conda create-n myvirtualenv python=2.7
然后source激活myvirtualenv
然后在conda环境中重复pip安装并重试脚本?一旦您使用完环境,只需键入source deactivate
,然后您就可以访问系统级python 3.5环境是的,我认为Python3用户会遇到问题。我为此提出了诉讼。