Matplotlib Seaborn散点图矩阵-使用自定义样式添加额外点

Matplotlib Seaborn散点图矩阵-使用自定义样式添加额外点,matplotlib,seaborn,Matplotlib,Seaborn,我正在GitHub上对一些开源项目的活动进行k-means聚类,并尝试使用 我可以成功地绘制聚类分析的结果(下面的tsv输出示例) 用户\u id问题\u评论问题\u创建的请求请求\u审查\u评论请求类别 1 0.1493651979088722.0100502512562812 0.0 0.60790273556231组0 1882 0.11202389843166542 0.5025125628140703 0.0 0.0第1组 2.315160567587752 20.6030150753

我正在GitHub上对一些开源项目的活动进行k-means聚类,并尝试使用

我可以成功地绘制聚类分析的结果(下面的tsv输出示例)

用户\u id问题\u评论问题\u创建的请求请求\u审查\u评论请求类别
1 0.1493651979088722.0100502512562812 0.0 0.60790273556231组0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0第1组
2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462第2组
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297第3组
我遇到的问题是,我想也能在矩阵图上画出簇的质心。目前,我正在打印脚本,如下所示:

导入seaborn作为sns
作为pd进口熊猫
从pylab导入savefig
sns.set()
#默认情况下,熊猫假定第一列是索引
#因此,它将被跳过。在我们的例子中,它是用户id
data=pd.DataFrame.from_csv('summary_clusters.tsv',sep='\t'))
grid=sns.pairplot(数据,hue=“category”,diag_kind=“kde”)
savefig('normalized_clusters.png',dpi=150)
这将产生预期的输出:

我希望能够在每个图上标出星团的质心。我可以想出两种方法:

  • 创建一个新的“质心”类别,并将其与其他点一起绘制
  • 调用
    sns.pairplot(data,hue=“category”,diag\u kind=“kde”)
    后,手动向绘图添加额外的点
  • 如果(1)是解决方案,那么我希望能够定制标记(可能是星星?),使其更加突出


    如果我洗耳恭听。我对Seaborn和Matplotlib还很陌生,所以欢迎提供任何帮助:-)

    pairplot
    不太适合这种情况,但可以通过一些技巧使其工作。这是我要做的

    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    sns.set_color_codes()
    
    # Make some random iid data
    cov = np.eye(3)
    ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
                    np.random.multivariate_normal([1, 1, 1], cov, 50)])
    ds = pd.DataFrame(ds, columns=["x", "y", "z"])
    
    # Fit the k means model and label the observations
    km = KMeans(2).fit(ds)
    ds["label"] = km.labels_.astype(str)
    
    现在是不明显的部分:您需要创建一个具有质心位置的数据框,然后将其与观测数据框相结合,同时使用
    标签
    列适当标识质心:

    centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
    centroids["label"] = ["0 centroid", "1 centroid"]
    full_ds = pd.concat([ds, centroids], ignore_index=True)
    
    然后您只需要使用
    PairGrid
    ,它比
    pairplot
    灵活一点,允许您通过色调变量和颜色映射其他打印属性(代价是无法在对角线上绘制直方图):

    另一种解决方案是,将观察值绘制为正常值,然后更改
    PairGrid
    对象上的数据属性,并添加一个新层。我会称之为黑客,但在某些方面它更直接

    # Plot the data
    g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
    
    # Change the PairGrid dataset and add a new layer
    centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
    g.data = centroids
    g.hue_vals = [0, 1]
    g.map_offdiag(plt.scatter, s=500, marker="*")
    

    我知道我参加聚会有点晚了,但这里是mwaskom代码的通用版本,用于处理n个集群。可能会节省一些时间

    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    def cluster_scatter_matrix(data_norm, cluster_number):
    
         sns.set_color_codes()
         km = KMeans(cluster_number).fit(data_norm)
         data_norm["label"] = km.labels_.astype(str)
         centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
         centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
         full_ds = pd.concat([data_norm, centroids], ignore_index=True)
         g = sns.PairGrid(full_ds, hue="label",
                     hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
                     #palette=["b", "r", "b", "r"],
                     hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
                              "marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
                    )
         g.map(plt.scatter, linewidth=1, edgecolor="w")
         g.add_legend()
    
    从seaborn 0.11.0开始,替代解决方案(更新数据,然后调用
    g.map\u offdiag
    似乎已被破坏。我在这里提出了一个问题:
    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    def cluster_scatter_matrix(data_norm, cluster_number):
    
         sns.set_color_codes()
         km = KMeans(cluster_number).fit(data_norm)
         data_norm["label"] = km.labels_.astype(str)
         centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
         centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
         full_ds = pd.concat([data_norm, centroids], ignore_index=True)
         g = sns.PairGrid(full_ds, hue="label",
                     hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
                     #palette=["b", "r", "b", "r"],
                     hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
                              "marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
                    )
         g.map(plt.scatter, linewidth=1, edgecolor="w")
         g.add_legend()