Python 计算每行前10个最频繁的单词

Python 计算每行前10个最频繁的单词,python,pandas,Python,Pandas,我的示例数据集如下所示: "Author", "Normal_Tokenized" x , ["I","go","to","war","I",..] y , ["me","you","and","us",..] z , ["l

我的示例数据集如下所示:

"Author", "Normal_Tokenized"  
x       , ["I","go","to","war","I",..]  
y       , ["me","you","and","us",..]
z       , ["let","us","do","our","best",..]
我想要一个数据框,报告10个最常见的单词和每个作者的计数(频率):

"x_text", "x_count", "y_text", "y_count", "z_text", "z_count"  
go ,        1000   ,  come   ,  120     , let     , 12
等等

我尝试使用以下代码段,但它只使用最后一个作者值而不是所有作者值

这段代码实际上返回了作者在其小说中使用的10个最常见的单词

df_words = pd.concat([pd.DataFrame(
    data={'Author': [row['Author'] for _ in row['Normal_Tokenized']], 'Normal_Tokenized': row['Normal_Tokenized']})
    for idx, row in df.iterrows()], ignore_index=True)
df_words = df_words[~df_words['Normal_Tokenized'].isin(stop_words)]

def authorCommonWords(numWords):
    for author in authors:
        authorWords = df_words[df_words['Author'] == author].groupby('Normal_Tokenized').size().reset_index().rename(
            columns={0: 'Count'})
        authorWords.sort_values('Count', inplace=True)
        df = pd.DataFrame(authorWords[-numWords:])
    df.to_csv("common_word.csv", header=False,mode='a', encoding='utf-8',
                  index=False)
    return authorWords[-numWords:]

authorCommonWords(10)
每个作者大约有13万份样本。这个例子得到了130000个样本中重复次数最多的10个单词。我希望这10个字在每个作者的单独一栏中。

似乎就是你要找的

资料 代码
我们最好能看到你的进进出出DF。了解您需要在这里使用NLP的
apply
,但您可能可以在这里摆脱一些循环以获得更好的方法。@billhuang谢谢,完成了,请检查一下好吗?非常感谢,这很完美,但如果n_top超过数据长度,它面临ValueError:值的长度与索引错误的长度不匹配,例如,这里如果n_top更改为6,我们有这个错误,您对此有什么想法吗@只需使用虚拟值(text=”“和freq=0)来填补单词的不足。我已经更新了答案。
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "Author": ["x", "y", "z"],
    "Normal_Tokenized": [["I","go","to","war","I"],
                         ["me","you","and","us"],
                         ["let","us","do","our","best"]]
})
n_top = 6  # count top n

df_want = pd.DataFrame(index=range(n_top))
for au, ls in df.itertuples(index=False, name=None):
    words, freqs = np.unique(ls, return_counts=True)
    len_words = len(words)
    if len_words >= n_top:
        df_want[f"{au}_text"] = words[:n_top]
        df_want[f"{au}_count"] = freqs[:n_top]
    else:  # too few distinct words
        df_want[f"{au}_text"] = [words[i] if i < len_words else "" for i in range(n_top)]
        df_want[f"{au}_count"] = [freqs[i] if i < len_words else 0 for i in range(n_top)]
print(df_want)
  x_text  x_count y_text  y_count z_text  z_count
0      I        2    and        1   best        1
1     go        1     me        1     do        1
2     to        1     us        1    let        1
3    war        1    you        1    our        1
4               0               0     us        1
5               0               0               0