Python 基于距离阈值停止准则的编辑距离矩阵单连杆聚类_Python_Python 3.x_Scipy_Cluster Analysis_Bioinformatics

Python 基于距离阈值停止准则的编辑距离矩阵单连杆聚类

python python-3.x

Python 基于距离阈值停止准则的编辑距离矩阵单连杆聚类,python,python-3.x,scipy,cluster-analysis,bioinformatics,Python,Python 3.x,Scipy,Cluster Analysis,Bioinformatics,我试图将平面、单链接簇分配给序列ID，序列ID由编辑距离

我试图将平面、单链接簇分配给序列ID，序列ID由编辑距离criteria='distance'的

scipy.cluster.hierarchy.fclusterdata（）

可能是实现这一点的一种方法，但它并没有返回我期望用于这个玩具示例的集群

具体地说，在下面的4x4距离矩阵示例中，我希望

clusters_50

（使用

t=50

）创建2个簇，其中实际找到3个簇。我认为问题在于

fclusterdata（）

不需要距离矩阵，但

fcluster（）

似乎也不符合我的要求

我还研究了

sklearn.cluster.aggregativeclustering

，但这需要指定

n_clusters

，我希望根据需要创建尽可能多的簇，直到满足我指定的距离阈值

我发现目前有一个未合并的scikit学习拉取请求用于此确切功能：

谁能给我指出正确的方向吗？使用绝对距离阈值标准的集群似乎是一个常见的用例

import pandas as pd
from scipy.cluster.hierarchy import fclusterdata

cols = ['a', 'b', 'c', 'd']

df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
                   {'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
                   {'a': 35, 'b': 29468, 'c': 0, 'd': 38},
                   {'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
                  index=cols)

clusters_20 = fclusterdata(df.values, t=20, criterion='distance')
clusters_50 = fclusterdata(df.values, t=50, criterion='distance')
clusters_100 = fclusterdata(df.values, t=100, criterion='distance')

names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}

通过将

linkage（）

传递到

fcluster（）

，它支持

metric='precomputed'

，而不像

fclusterdata（）

解决方案：作为一项功能：

您没有设置度量参数

默认值是

metric='euclidean'

，不是预先计算的。

谢谢，但我认为这实际上不是问题所在

fclusterdata（）

不接受

metric='precomputed'

，因为我现在的理解是，与

fcluster（）

相比，它直接用于观测，而不是距离矩阵。将

metric='precomputed'

传递给

fclusterdata（）

会给出

ValueError:Unknown Distance metric:precomputed

好吧，问题是fclusterdata使用欧几里德距离，而它不能使用预计算的距离矩阵（因此需要使用另一个函数），不是吗？请把你的讽刺带到别处去——我是想表达感激之情。向下滚动以获取我在您的答案前一小时发布的已接受答案。该答案没有提到

度量值参数是关键。fclusterdata可以简单地修改，以便在将来接受预计算的距离矩阵。同意API可以更简单和/或更好地记录使用示例。在已接受的答案中强调了度量值arg。
names_clusters_20  # Expecting 3 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}

names_clusters_50  # Expecting 2 clusters, finds 3
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}

names_clusters_100 # Expecting 2 clusters, finds 2
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}

fcluster(linkage(condensed_dm, metric='precomputed'), criterion='distance', t=20)

import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import linkage, fcluster

cols = ['a', 'b', 'c', 'd']

df = pd.DataFrame([{'a': 0, 'b': 29467, 'c': 35, 'd': 13},
                   {'a': 29467, 'b': 0, 'c': 29468, 'd': 29470},
                   {'a': 35, 'b': 29468, 'c': 0, 'd': 38},
                   {'a': 13, 'b': 29470, 'c': 38, 'd': 0}],
                  index=cols)

dm_cnd = squareform(df.values)

clusters_20 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=20)
clusters_50 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=50)
clusters_100 = fcluster(linkage(dm_cnd, metric='precomputed'), criterion='distance', t=100)

names_clusters_20 = {n: c for n, c in zip(cols, clusters_20)}
names_clusters_50 = {n: c for n, c in zip(cols, clusters_50)}
names_clusters_100 = {n: c for n, c in zip(cols, clusters_100)}

names_clusters_20
>>> {'a': 1, 'b': 3, 'c': 2, 'd': 1}

names_clusters_50
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}

names_clusters_100
>>> {'a': 1, 'b': 2, 'c': 1, 'd': 1}

import pandas as pd
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage

def cluster_df(df, method='single', threshold=100):
    '''
    Accepts a square distance matrix as an indexed DataFrame and returns a dict of index keyed flat clusters 
    Performs single linkage clustering by default, see scipy.cluster.hierarchy.linkage docs for others
    '''

    dm_cnd = squareform(df.values)
    clusters = fcluster(linkage(dm_cnd,
                                method=method,
                                metric='precomputed'),
                        criterion='distance',
                        t=threshold)
    names_clusters = {s:c for s, c in zip(df.columns, clusters)}
return names_clusters