Python 计数器（）和最常见的_Python_Counter

Python 计数器（）和最常见的

python

Python 计数器（）和最常见的,python,counter,Python,Counter,我正在使用计数器（）计算excel文件中的字数。我的目标是从文档中获取最常用的单词。计数器（）不能正确处理我的文件的问题。代码如下： #1. Building a Counter with bag-of-words import pandas as pd df = pd.read_excel('combined_file.xlsx', index_col=None) import nltk from nltk.tokenize import word_tokenize # Token

我正在使用计数器（）计算excel文件中的字数。我的目标是从文档中获取最常用的单词。计数器（）不能正确处理我的文件的问题。代码如下：

#1. Building a Counter with bag-of-words

import pandas as pd
df = pd.read_excel('combined_file.xlsx', index_col=None)
import nltk

from nltk.tokenize import word_tokenize

# Tokenize the article: tokens
df['tokens'] = df['body'].apply(nltk.word_tokenize)

# Convert the tokens into string values
df_tokens_list = df.tokens.tolist()

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [[string.lower() for string in sublist] for sublist in df_tokens_list]

# Import Counter

from collections import Counter

# Create a Counter with the lowercase tokens: bow_simple

bow_simple = Counter(x for xs in lower_tokens for x in set(xs))

# Print the 10 most common tokens
print(bow_simple.most_common(10))

#2. Text preprocessing practice

# Import WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in bow_simple if t.isalpha()]

# Remove all stop words: no_stops 
from nltk.corpus import stopwords

no_stops = [t for t in alpha_only if t not in stopwords.words("english")]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)
print(bow)
# Print the 10 most common tokens
print(bow.most_common(10))

预处理后最常见的单词有：

[（'dry'，3），（'try'，3），（'clean'，3），（'love'，2），（'one'，2），（'serum'，2），（'eye'，2），（'boot'，2），（'woman'，2），（'cream'，2）]

如果我们在excel中手工计算这些单词，则情况并非如此。你知道我的代码有什么问题吗？我将感谢在这方面提供的任何帮助

指向该文件的链接位于此处：

问题在于

bow\u simple

值是一个计数器，需要进一步处理。这意味着所有项目在列表中只出现一次，最终结果只是计算当使用

nltk

降低并处理时，计数器中出现的单词的变化量。解决方案是创建一个扁平的单词列表，并将其输入到

alpha\u

：

# Create a Counter with the lowercase tokens: bow_simple
wordlist = [item for sublist in lower_tokens for item in sublist] #flatten list of lists
bow_simple = Counter(wordlist)

然后仅在alpha_中使用单词列表：

alpha_only = [t for t in wordlist if t.isalpha()]

输出：

[('eye', 3617), ('product', 2567), ('cream', 2278), ('skin', 1791), ('good', 1081), ('use', 1006), ('really', 984), ('using', 928), ('feel', 798), ('work', 785)]

这段代码正是您编写它的目的。什么让你对结果不满意？我有一个猜测，但确认你的意图。这是完美的，解决了我的问题！非常感谢你！！！