Python 计算每行前10个最频繁的单词
我的示例数据集如下所示:Python 计算每行前10个最频繁的单词,python,pandas,Python,Pandas,我的示例数据集如下所示: "Author", "Normal_Tokenized" x , ["I","go","to","war","I",..] y , ["me","you","and","us",..] z , ["l
"Author", "Normal_Tokenized"
x , ["I","go","to","war","I",..]
y , ["me","you","and","us",..]
z , ["let","us","do","our","best",..]
我想要一个数据框,报告10个最常见的单词和每个作者的计数(频率):
"x_text", "x_count", "y_text", "y_count", "z_text", "z_count"
go , 1000 , come , 120 , let , 12
等等
我尝试使用以下代码段,但它只使用最后一个作者值而不是所有作者值
这段代码实际上返回了作者在其小说中使用的10个最常见的单词
df_words = pd.concat([pd.DataFrame(
data={'Author': [row['Author'] for _ in row['Normal_Tokenized']], 'Normal_Tokenized': row['Normal_Tokenized']})
for idx, row in df.iterrows()], ignore_index=True)
df_words = df_words[~df_words['Normal_Tokenized'].isin(stop_words)]
def authorCommonWords(numWords):
for author in authors:
authorWords = df_words[df_words['Author'] == author].groupby('Normal_Tokenized').size().reset_index().rename(
columns={0: 'Count'})
authorWords.sort_values('Count', inplace=True)
df = pd.DataFrame(authorWords[-numWords:])
df.to_csv("common_word.csv", header=False,mode='a', encoding='utf-8',
index=False)
return authorWords[-numWords:]
authorCommonWords(10)
每个作者大约有13万份样本。这个例子得到了130000个样本中重复次数最多的10个单词。我希望这10个字在每个作者的单独一栏中。似乎就是你要找的
资料
代码
我们最好能看到你的进进出出DF。了解您需要在这里使用NLP的
apply
,但您可能可以在这里摆脱一些循环以获得更好的方法。@billhuang谢谢,完成了,请检查一下好吗?非常感谢,这很完美,但如果n_top超过数据长度,它面临ValueError:值的长度与索引错误的长度不匹配,例如,这里如果n_top更改为6,我们有这个错误,您对此有什么想法吗@只需使用虚拟值(text=”“和freq=0)来填补单词的不足。我已经更新了答案。
import numpy as np
import pandas as pd
df = pd.DataFrame({
"Author": ["x", "y", "z"],
"Normal_Tokenized": [["I","go","to","war","I"],
["me","you","and","us"],
["let","us","do","our","best"]]
})
n_top = 6 # count top n
df_want = pd.DataFrame(index=range(n_top))
for au, ls in df.itertuples(index=False, name=None):
words, freqs = np.unique(ls, return_counts=True)
len_words = len(words)
if len_words >= n_top:
df_want[f"{au}_text"] = words[:n_top]
df_want[f"{au}_count"] = freqs[:n_top]
else: # too few distinct words
df_want[f"{au}_text"] = [words[i] if i < len_words else "" for i in range(n_top)]
df_want[f"{au}_count"] = [freqs[i] if i < len_words else 0 for i in range(n_top)]
print(df_want)
x_text x_count y_text y_count z_text z_count
0 I 2 and 1 best 1
1 go 1 me 1 do 1
2 to 1 us 1 let 1
3 war 1 you 1 our 1
4 0 0 us 1
5 0 0 0