Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/xcode/7.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 按类别分组的句子中最常用的单词_Python_Pandas - Fatal编程技术网

Python 按类别分组的句子中最常用的单词

Python 按类别分组的句子中最常用的单词,python,pandas,Python,Pandas,我试着按类别把10个最常用的单词分组。我已经看到了答案,但我不能完全修改它以获得我想要的输出 category | sentence A cat runs over big dog A dog runs over big cat B random sentences include words C including this one 所需输出: category | word/frequency

我试着按类别把10个最常用的单词分组。我已经看到了答案,但我不能完全修改它以获得我想要的输出

category | sentence
  A           cat runs over big dog
  A           dog runs over big cat
  B           random sentences include words
  C           including this one
所需输出:

category | word/frequency
   A           runs, 2
               cat: 2
               dog: 2
               over: 2
               big: 2
   B           random: 1
   C           including: 1
由于我的数据框相当大,我只想得到前10个最经常出现的单词。我也看过这个


但是这个方法也会返回字母的计数。

您可以在标记句子后加入行并应用FreqDist

# Split the sentence into Series    
df1 = pd.DataFrame(df.sentence.str.split(' ').tolist())

# Add category with as not been adding with the split
df1['category']  = df['category']

# Melt the Series corresponding to the splited sentence
df1 = pd.melt(df1, id_vars='category', value_vars=df1.columns[:-1].tolist())

# Groupby and count (reset_index will create a column nammed 0)
df1 = df1.groupby(['category', 'value']).size().reset_index()

# Keep the 10 largests numbers 
df1 = df1.nlargest(10, 0)
df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
输出:


如果您希望按最常出现的单词的频率进行筛选,则以下行可以(在本例中,每个类别有2个最常出现的单词):

性能方面:

%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这很接近-对于我的最终输出,我正在尝试实现更像df1.groupby('category')[0]的东西。value_counts()谢谢!我会玩更多的游戏来获得我想要的格式,但这是一个很好的开始
category           
a         big          2.0
          cat          2.0
          dog          2.0
          over         2.0
          runs         2.0
c         include      1.0
          random       1.0
          sentences    1.0
          words        1.0
d         including    1.0
          one          1.0
          this         1.0
Name: sentence, dtype: float64
from collections import Counter

df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))

category
A            [(cat, 2), (runs, 2)]
B    [(random, 1), (sentences, 1)]
C      [(including, 1), (this, 1)]
Name: sentence, dtype: object
%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)