Python中使用Scipy层次聚类的文本聚类_Python_Scipy_Cluster Analysis_Text Mining

Python中使用Scipy层次聚类的文本聚类

python

Python中使用Scipy层次聚类的文本聚类,python,scipy,cluster-analysis,text-mining,Python,Scipy,Cluster Analysis,Text Mining,我有一个文本语料库，每行包含1000多篇文章。我正在尝试使用python中的Scipy使用层次集群来生成相关文章的集群。这是我用来做集群的代码 # Agglomerative Clustering import matplotlib.pyplot as plt import scipy.cluster.hierarchy as hac tree = hac.linkage(X.toarray(), method="complete",metric="euclidean") plt.clf()

我有一个文本语料库，每行包含1000多篇文章。我正在尝试使用python中的Scipy使用层次集群来生成相关文章的集群。这是我用来做集群的代码

# Agglomerative Clustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hac
tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")
plt.clf()
hac.dendrogram(tree)
plt.show()

clustering = fcluster(tree, 2, 'maxclust')

我得到了这个阴谋

然后，我用fcluster（）在第三层砍掉了这棵树

我得到了这个输出： [2…，2]

我的问题是，如何在每一组中找到前10个常用词，以便为每一组建议一个主题？

您可以执行以下操作：

将结果（您的

聚类

变量）与您的输入（1000多篇文章）对齐

使用pandas库，您可以使用一个

groupby函数

，并将cluster#作为其键

每组（使用

get\u group函数

），为每个你遇到的话

现在，您可以按降序对字典中的字数进行排序，并获得所需的最常用字数

祝你好运，如果我的答案是你想要的，请接受我的答案。

我会这么做的。给定一个

df

，文章名称和文章文本如下

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Argument  6 non-null      object
 1   Article   6 non-null      object
dtypes: object(2)
memory usage: 224.0+ bytes

然后得到所选择的聚类

# Agglomerative Clustering
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as hac
tree = hac.linkage(X.toarray(), method="complete",metric="euclidean")
plt.clf()
hac.dendrogram(tree)
plt.show()

clustering = fcluster(tree, 2, 'maxclust')

并将集群添加到

df\u dtm

df_dtm['_cluster_'] = clustering
df_dtm.index.name = '_article_'
df_word_count = df_dtm.groupby('_cluster_').sum().reset_index().melt(
    id_vars=['_cluster_'], var_name='_word_', value_name='_count_'
)

最后取第一个最常用的单词

words_1 = df_word_count[df_word_count._cluster_==1].sort_values(
    by=['_count_'], ascending=False).head(3)
words_2 = df_word_count[df_word_count._cluster_==2].sort_values(
    by=['_count_'], ascending=False).head(3)

为什么你认为3是一个合适的值？