Python 在计算偏差时,如何保存集群分配并防止它们在下一次迭代中被覆盖?

Python 在计算偏差时,如何保存集群分配并防止它们在下一次迭代中被覆盖?,python,scikit-learn,cluster-analysis,k-means,fairness-indicators,Python,Scikit Learn,Cluster Analysis,K Means,Fairness Indicators,我正在实现一个算法,计算每个簇的偏差,然后将偏差最大的簇拆分为新簇。最后,我希望找到偏差最大的集群,这意味着分类器要么在这些实例上产生更多的错误,要么产生更少的错误 这是算法: 从整个数据集作为一个集群开始 使用KMeans将其分成两个集群 计算每个簇的宏F1分数 计算这两个簇的偏差。偏差是: F1-score_cluster_k-F1为所有簇(不包括簇k)打分 如果Max(bias_cluster_i,bias_cluster_j)>=bias_previous_cluster: 将集群clu

我正在实现一个算法,计算每个簇的偏差,然后将偏差最大的簇拆分为新簇。最后,我希望找到偏差最大的集群,这意味着分类器要么在这些实例上产生更多的错误,要么产生更少的错误

这是算法:

  • 从整个数据集作为一个集群开始
  • 使用KMeans将其分成两个集群
  • 计算每个簇的宏F1分数
  • 计算这两个簇的偏差。偏差是: F1-score_cluster_k-F1为所有簇(不包括簇k)打分
  • 如果Max(bias_cluster_i,bias_cluster_j)>=bias_previous_cluster: 将集群cluster_i和cluster_j添加到列表中,并删除上一个集群
  • 继续从cluster_列表中选择误差度量标准偏差最高的聚类
  • 使用KMeans将该集群拆分为2个集群,然后继续执行步骤3
  • 为了使这个算法能够工作,我需要保存集群分配和以前迭代中的F分数,以便能够在当前迭代中比较它们(步骤5)

    • 我的解决方案之一是将集群分配保存为新列,然后将此列与新集群分配进行比较,但是有没有更好的方法防止这些集群分配被覆盖
    这是我的代码:

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.cluster import KMeans
    from sklearn.datasets import load_wine
    
    data = load_wine()
    df_data = pd.DataFrame(data.data, columns=data.feature_names)
    df_target = pd.DataFrame(data = data.target)
    
    # Merging the datasets into one dataframe
    all_data = df_data.merge(df_target, left_index=True, right_index=True)
    all_data.rename( columns={0 :'target_class'}, inplace=True )
    all_data.head()
    
    # Dividing X and y into train and test data (small train data to gain more errors)
    X_train, X_test, y_train, y_test = train_test_split(df_data, df_target, test_size=0.60, random_state=2)
    
    # Training a RandomForest Classifier 
    model = RandomForestClassifier()
    model.fit(X_train, y_train.values.ravel())
    
    # Obtaining predictions
    y_hat = model.predict(X_test)
    
    # Converting y_hat from Np to DF
    predictions_col = pd.DataFrame()
    predictions_col['predicted_class'] = y_hat.tolist()
    predictions_col['true_class'] = y_test
    
    # Calculating the errors with the absolute value 
    predictions_col['errors'] = abs(predictions_col['predicted_class'] - predictions_col['true_class'])
    
    # It doesn't matter whether the misclassification is between class 0 and 2 or between 0 and 1, it has the same error value. 
    predictions_col['errors'] = predictions_col['errors'].replace(2.0, 1.0)
    
    # Adding predictions to test data
    df_out = pd.merge(X_test, predictions_col, left_index = True, right_index = True)
    
    # Scaling the features
    scaled_matrix = StandardScaler().fit_transform(df_matrix)
    
    # Calculating the errors of the instances in the clusters.
    def F_score(results, class_number):
        true_pos = results.loc[results["true_class"] == class_number][results["predicted_class"] == class_number]
        true_neg = results.loc[results["true_class"] != class_number][results["predicted_class"] != class_number]
        false_pos = results.loc[results["true_class"] != class_number][results["predicted_class"] == class_number]
        false_neg = results.loc[results["true_class"] == class_number][results["predicted_class"] != class_number]
        
        try:
            precision =  len(true_pos)/(len(true_pos) + len(false_pos))
        except ZeroDivisionError:
            return 0
        try:
            recall = len(true_pos)/(len(true_pos) + len(false_neg))
        except ZeroDivisionError:
            return 0
    
        f_score = 2 * ((precision * recall)/(precision + recall))
    
        return f_score
    
    # Calculating the macro average F-score
    def mean_f_score(results):
        n_classes = results['true_class'].unique()
        class_list = []
        for i in range(0, n_classes-1):
            class_i = F_score(results, i)
            class_list.append(class_i)
       
        mean_f_score = (sum(class_list))/n_classes
        
        return(mean_f_score)
    
    def calculate_bias(clustered_data, cluster_number):
        cluster_x = clustered_data.loc[clustered_data["assigned_cluster"] == cluster_number]
        remaining_clusters = clustered_data.loc[clustered_data["assigned_cluster"] != cluster_number]
        
        # Bias definition:
        return mean_f_score(remaining_clusters) - mean_f_score(cluster_x)
    
    MAX_ITER = 10
    cluster_comparison = []
    
    # start with all instances in one cluster
    # scaled_matrix
    for i in range(1, MAX_ITER):
        kmeans_algo = KMeans(n_clusters=2, **clus_model_kwargs).fit(scaled_matrix) 
        clustered_data = pd.DataFrame(kmeans_algo.predict(scaled_matrix), columns=['assigned_cluster']) 
    # Adding the assigned cluster to the column 
        # groups = pd.DataFrame(cluster_model.predict(df_data),columns=["group"])
        
        # Calculating bias per cluster
        for cluster in clustered_data:
            negative_bias_0 = calculate_bias(clustered_data, 0)
            negative_bias_1 = calculate_bias(clustered_data, 1)
        # the code below doesn't work
        if max(negative_bias_0, negative_bias_1) >= bias_prev_iteration:
    
    

    你能把代码缩减到基本部分吗?我试着运行它,它在任何地方都会返回错误。例如,你的代码的某些部分使用
    predicted\u class
    ,而其他部分使用
    assigned\u cluster
    作为输入。。把这些放在一起是行不通的