Python TF-IDF使用NLTK库对一堆txt文件进行标记化和柠檬化_Python_Text_Nlp_Nltk_Text Analysis

Python TF-IDF使用NLTK库对一堆txt文件进行标记化和柠檬化

python text nlp

Python TF-IDF使用NLTK库对一堆txt文件进行标记化和柠檬化,python,text,nlp,nltk,text-analysis,Python,Text,Nlp,Nltk,Text Analysis,对意大利语文本进行文本分析（标记化、柠檬化），以便将来使用TF-IDF技术，并在此基础上构建集群。对于预处理，我使用NLTK，对于一个文本文件，一切正常： import nltk from nltk.stem.wordnet import WordNetLemmatizer it_stop_words = nltk.corpus.stopwords.words('italian') lmtzr = WordNetLemmatizer() with open('3003.txt', 'r'

对意大利语文本进行文本分析（标记化、柠檬化），以便将来使用TF-IDF技术，并在此基础上构建集群。对于预处理，我使用NLTK，对于一个文本文件，一切正常：

import nltk
from nltk.stem.wordnet import WordNetLemmatizer

it_stop_words = nltk.corpus.stopwords.words('italian')

lmtzr = WordNetLemmatizer()

with open('3003.txt', 'r' , encoding="latin-1") as myfile:
    data=myfile.read()

word_tokenized_list = nltk.tokenize.word_tokenize(data)

word_tokenized_no_punct = [str.lower(x) for x in word_tokenized_list if x not in string.punctuation]

word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]

word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]

word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lmtzr.lemmatize(x) for x in word_tokenized_no_punct_no_sw_no_apostrophe]

但问题是，我需要执行以下操作来将.txt文件打包到文件夹中。为此，我尝试使用

PlaintextCorpusReader（）

的可能性：

基本上，我不能将

newcorpus

应用到前面的函数中，因为它是一个对象而不是字符串。因此，我的问题是：

为文件库进行标记化和柠檬化（使用

PlaintextCorpusReader（）

）时，函数应该是什么样子（或者我应该如何更改不同文件的现有函数）

TF-IDF方法（标准sklearn方法

vectorizer=TfidfVectorizer（）

在

PlaintextCorpusReader（）

非常感谢！

我想你的问题可以通过阅读以下内容来回答：this、this和[TfidfVectorizer docs][3]。为了完整起见，我将答案包装在下面：

首先，要获取文件ID，可以按如下方式获取：

ids = newcorpus.fileids()

然后，根据您可以检索文档的单词、句子或段落：

doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
    # Get words
    doc_words.append(newcorpus.words(id_))
    # Get sentences
    doc_sents.append(newcorpus.sents(id_))
    # Get paragraph
    doc_paras.append(newcorpus.paras(id_))

现在，在

doc\u words

、

doc\u sents

和

doc\u parats

的第i个位置上，语料库中的每个文档都有相应的单词、句子和段落

对于tf idf您可能只需要这些单词。由于的方法得到一个iterable，它生成

str

、unicode或文件对象，因此您需要转换文档（标记化单词数组）后一种解决方案使用虚拟标记器直接处理单词数组

您还可以将自己的标记器传递给

TfidVectorizer

，并使用PlaintextCorpusReader进行文件读取。

对于删除标点符号和停止词，我经常使用列表理解。通过快速研究，我发现这一点与其他方法有关，例如使用

过滤器

。我想，对于元素化，您也可以使用列表编译理解。我觉得有点奇怪，你不能直接将一系列单词进行柠檬化……你可以将所有这些都用一行来概括，例如：

words=[如果单词不在stopwords中，则对单词中的单词进行柠檬化，而len（word）>1]

无法执行此操作，因为StreamBackedCorpusView对象未损坏，您看不到您在哪里使用

StreamBackedCorpusView

您可以使用我的回答中的

doc\u单词执行我在上一次评论中描述的操作，您还可以看到我提供的最后一个链接
doc_words = []
doc_sents = []
doc_paras = []
for id_ in ids:
    # Get words
    doc_words.append(newcorpus.words(id_))
    # Get sentences
    doc_sents.append(newcorpus.sents(id_))
    # Get paragraph
    doc_paras.append(newcorpus.paras(id_))