Python 识别出现在少于1%语料库文档中的单词_Python_Nlp_Nltk_Counter_Tf Idf

Python 识别出现在少于1%语料库文档中的单词

python nlp

Python 识别出现在少于1%语料库文档中的单词,python,nlp,nltk,counter,tf-idf,Python,Nlp,Nltk,Counter,Tf Idf,我有一个客户评论的语料库，希望识别稀有词，对我来说，这些词出现在语料库文档中的比例不到1% 我已经有了一个可行的解决方案，但它对我的脚本来说太慢了： # Review data is a nested list of reviews, each represented as a bag of words doc_clean = [['This', 'is', 'review', '1'], ['This', 'is', 'review', '2'], ..] # Save all words

我有一个客户评论的语料库，希望识别稀有词，对我来说，这些词出现在语料库文档中的比例不到1%

我已经有了一个可行的解决方案，但它对我的脚本来说太慢了：

# Review data is a nested list of reviews, each represented as a bag of words
doc_clean = [['This', 'is', 'review', '1'], ['This', 'is', 'review', '2'], ..] 

# Save all words of the corpus in a set
all_words = set([w for doc in doc_clean for w in doc])

# Initialize a list for the collection of rare words
rare_words = []

# Loop through all_words to identify rare words
for word in all_words:

    # Count in how many reviews the word appears
    counts = sum([word in set(review) for review in doc_clean])

    # Add word to rare_words if it appears in less than 1% of the reviews
    if counts / len(doc_clean) <= 0.01:
        rare_words.append(word)

#评论数据是一个嵌套的评论列表，每个评论都表示为一袋单词
doc_clean=[[This'、[is'、[review'、[1']、[This'、[is'、[review'、[2']、]
#将语料库中的所有单词保存在一个集合中
所有单词=设置（[w代表文档中的文档”\U清理代表文档中的w]）
#初始化稀有词集合的列表
稀有词=[]
#循环浏览所有单词以识别稀有单词
对于所有单词中的单词：
#计算单词出现的评论数
计数=总和（[文档中用于审核的集合（审核）中的单词\u clean]）
#如果在少于1%的评论中出现，则将单词添加到稀有单词中
如果counts/len（doc_clean）这可能不是最有效的解决方案，但它很容易理解和维护，我自己也经常使用它。我使用柜台和熊猫：
import pandas as pd
from collections import Counter

将计数器应用于每个文档并构建术语频率矩阵：
df = pd.DataFrame(list(map(Counter, doc_clean)))

矩阵中的某些字段未定义。它们对应于特定文档中未出现的单词。计算发生的次数：
counts = df.notnull().sum()

现在，选择不经常出现的单词：
rare_words = counts[counts < 0.05 * len(doc_clean)].index.tolist()

ravel\u words=counts[counts<0.05*len（doc\u clean）].index.tolist（）
set（[w代表doc中的doc\u clean代表doc中的w]）
=>{w代表doc中的doc\u clean代表doc中的w}
保存列表创建，直接创建集合。但下面的答案甚至更好：不要将列表传递给sum，只传递给gencomp。@Jean-Françoisfab你能更详细地解释“传递gencomp”吗？是的，不要做sum（[x代表x in…]）
只要做sum（x代表x in…）
但这种方法只确定单词相对于整个语料库的相对频率，对吗？我特别想搜索出现在少于1%的文档中的单词。如果“AAA”一词不包括在前999个文档中，但最后一个文档是“AAA…”，那么它仍然应该是一个罕见的词。对不起，我误解了你的问题。我改变了答案。注意：TF和DF可能不同。OP代码似乎要求DF#计算单词出现的评论数
；P我鼓励OP用户查看@alvas，因为答案计算DF.BTW，这里有一个很酷的技巧：DF=list（map（Counter，documents））
，然后tf=sum（DF，Counter（））
和all_counts=sum（tf.values（））
=）