Python 最常用的单词或短语的频率_Python_Machine Learning_Nlp_Nltk_Tokenize

Python 最常用的单词或短语的频率

python machine-learning nlp

Python 最常用的单词或短语的频率,python,machine-learning,nlp,nltk,tokenize,Python,Machine Learning,Nlp,Nltk,Tokenize,我试图分析一些应用程序评论中的数据我想使用nltk的FreqDist查看文件中最常出现的短语。它可以是单个标记或关键短语。我不想对数据进行标记化，因为这只会给我最频繁的标记。但是现在，FreqDist函数将每个评论作为一个字符串处理，而不是提取每个评论中的单词 df=pd.read\u csv（'Positive.csv'）） def预处理（文本）： translator=str.maketrans（“，”，string.标点符号） text=text.lower（）.strip（）.rep

我试图分析一些应用程序评论中的数据

我想使用nltk的FreqDist查看文件中最常出现的短语。它可以是单个标记或关键短语。我不想对数据进行标记化，因为这只会给我最频繁的标记。但是现在，FreqDist函数将每个评论作为一个字符串处理，而不是提取每个评论中的单词

df=pd.read\u csv（'Positive.csv'））
def预处理（文本）：
translator=str.maketrans（“，”，string.标点符号）
text=text.lower（）.strip（）.replace（“\n”，”）.replace（“，”）.translate（转换器）
返回文本
df['Description']=df['Description'].map（预处理）
df=df[df['Description']！=']
word_dist=nltk.FreqDist（df['Description']）

（“说明”是评论的正文/信息。）

例如，我想得到类似于最常用的术语： “我喜欢”、“有用”、“非常好的应用程序” 但是相反，我得到了最常用的术语： “我真的很喜欢这个应用，因为Baballa”（完整评论）

这就是为什么当我绘制频率分布时，我会得到：

TL；博士使用

ngrams

或

everygrams

：

>>> from itertools import chain
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> from nltk import FreqDist

>>> df = pd.read_csv('x')
>>> df['Description']
0            Here is a sentence.
1    This is a foo bar sentence.
Name: Description, dtype: object

>>> df['Description'].map(word_tokenize)
0              [Here, is, a, sentence, .]
1    [This, is, a, foo, bar, sentence, .]
Name: Description, dtype: object

>>> sents = df['Description'].map(word_tokenize).tolist()

>>> FreqDist(list(chain(*[everygrams(sent, 1, 3) for sent in sents])))
FreqDist({('sentence',): 2, ('is', 'a'): 2, ('sentence', '.'): 2, ('is',): 2, ('.',): 2, ('a',): 2, ('Here', 'is', 'a'): 1, ('a', 'foo'): 1, ('a', 'sentence'): 1, ('bar', 'sentence', '.'): 1, ...})

您可以创建（1,3）或类似的ngram范围。您将获得1个单词标记、2个单词标记以及3个单词标记。在1点上，你说你不想标记化，然后你说这不是提取单词！！！！谢谢@Vishal，我将使用ngrams！