Python混淆矩阵
我对聚类结果的评估有问题 我有3个清单:Python混淆矩阵,python,confusion-matrix,Python,Confusion Matrix,我对聚类结果的评估有问题 我有3个清单: # 10 objects in my corpus TOT = [1,2,3,4,5,6,7,8,9,10] # .... clustering into k=5 clusters # For each automatic cluster: # Objects with ID 2 and 8 are stored into this predicted = [2,8] # For each cluster in the g
# 10 objects in my corpus
TOT = [1,2,3,4,5,6,7,8,9,10]
# .... clustering into k=5 clusters
# For each automatic cluster:
# Objects with ID 2 and 8 are stored into this
predicted = [2,8]
# For each cluster in the ground truth:
true = [2,4,9]
# computes TP, FP, TN, FN
A = set(docs_in_cluster)
B = set(constraints)
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
我的问题是:我能为每个集群计算TP、FP、TN、FN吗?这有意义吗
编辑:可复制代码
短篇故事:
我在做NLP,我有一个9k文档的语料库,我用Gensim的Word2Vec处理,提取向量,并为每个文档计算一个“文档向量”。之后,我计算了文档向量之间的余弦相似性,得到了一个9k x 9k矩阵
最后,使用这个矩阵,我运行了KMeans和分层集群
让我们考虑HAC的输出,有14个簇:
id label
0 1
1 8
....
9k 12
现在的问题是:如何评估集群的质量?
我的教授已经阅读了这些9k文档中的100篇,并创建了一些“集群”,说:“好的,这篇文档讨论的是:label1
”或者“好的,这篇其他文档讨论的是label2
和label3
”
请注意,我的教授提供的标签与聚类过程完全无关,只是主题的摘要,但数字是相同的(在本例中=14)
代码
我有两个数据框架,上面一个来自HAC集群,另一个来自我教授的100个文档,看起来像:
(以前面的例子为例)
GT
id label1 label2 label3 .... label14
5 1 0 0 0
34 0 1 1 0
...........................
最后,我的代码执行以下操作:
# since I have labels only for 100 of my 9k documents
indexes = list(map(int, ground_truth['id'].values.tolist()))
reduced_df = clusters.loc[clusters['id'].isin(indexes), :]
# now reduced_df contains only the documents that have been read by my prof
TOT = set(reduced_df['id'].values.tolist())
for each cluster from HAC
doc_in_this_cluster = [ .... ]
for each cluster from GT
doc_in_this_label = [ ... ]
A = set(doc_in_this_cluster )
B = set(doc_in_this_label )
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
以及守则:
indexes = list(map(int, self.ground_truth['id'].values.tolist()))
# reduce clusters_file matching only manually analyzed documents: --------> TOT
reduced_df = self.clusters.loc[self.clusters['id'].isin(indexes), :]
TOT = set(reduced_df['id'].values.tolist())
clusters_groups = reduced_df.groupby('label')
for label, df_group in clusters_groups:
docs_in_cluster = df_group['id'].values.tolist()
row = []
for col in self.ground_truth.columns[1:]:
constraints = list(
map(int, self.ground_truth.loc[self.ground_truth[col] == 1, 'id'].values.tolist())
)
A = set(docs_in_cluster)
B = set(constraints)
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
print(f"HAC Cluster: {label} - GT Label: {col}")
print(TP, FP, TN, FN)
我假设您正在尝试实现集合操作。您可以尝试以下函数来解决您的问题:
def设置子目录(A、B):
C=[]
对于我来说,在一个:
如果我在B:
通过
其他:
C.1(i)
返回C
def SET交叉口(A、B):
C=[]
对于我来说,在一个:
如果我在B:
C.1(i)
返回C
TOT=[1,2,3,4,5,6,7,8,9,10]
A=[1,2,3,4]
B=[2,3]
打印(“A&B”,设置交叉点(A,B))
打印(“TOT-B”,设置子目录(TOT,B))
输出:
A&B[2,3]
TOT-B[1,4,5,6,7,8,9,10]
你必须自己实现这些功能。你的混淆矩阵让我感到困惑。你可能需要提供更多的上下文,最好是一个可以运行并解释错误的上下文。你在评论“语料库”-你是否通过ML进行自然语言处理?我将用所有这些信息编辑第一条消息