Python scipy.cluster.hierarchy链接矩阵格式的Newick树表示_Python_Scipy_Hierarchical Clustering_Phylogeny

Python scipy.cluster.hierarchy链接矩阵格式的Newick树表示

python

Python scipy.cluster.hierarchy链接矩阵格式的Newick树表示,python,scipy,hierarchical-clustering,phylogeny,Python,Scipy,Hierarchical Clustering,Phylogeny,我有一组根据DNA序列排列和聚类的基因，我有一组以纽克树表示的基因（）。有人知道如何将此格式转换为scipy.cluster.hierarchy.linkage matrix格式吗？从链接矩阵的scipy文档中：由4矩阵Z返回A（n-1）。在第i次迭代中，集群指数Z[i，0]和Z[i，1]组合形成簇n+i。A. 索引小于n的集群对应于n个原始集群中的一个观察。簇Z[i，0]和Z[i，1]之间的距离为由Z[i，2]给出。第四个值Z[i，3]表示新形成的星团中的原始观测至少在scipy文

我有一组根据DNA序列排列和聚类的基因，我有一组以纽克树表示的基因（）。有人知道如何将此格式转换为scipy.cluster.hierarchy.linkage matrix格式吗？从链接矩阵的scipy文档中：

由4矩阵Z返回A（n-1）。在第i次迭代中，集群指数Z[i，0]和Z[i，1]组合形成簇n+i。A. 索引小于n的集群对应于n个原始集群中的一个观察。簇Z[i，0]和Z[i，1]之间的距离为由Z[i，2]给出。第四个值Z[i，3]表示新形成的星团中的原始观测

至少在scipy文档中，他们对链接矩阵的结构描述相当混乱。“迭代”是什么意思？此外，这种表示法如何跟踪哪些原始观测值位于哪个簇中

我想知道如何进行这种转换，因为我的项目中的其他聚类分析的结果都是用scipy表示法完成的，我一直在使用scipy表示法进行绘图。

我得到了链接矩阵是如何从树表示法生成的，感谢@cel的澄清。让我们以Newick wiki页面（）为例

字符串格式的树是：

(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);

首先，我们应该计算所有树叶之间的距离。例如，如果我们希望计算距离A和B，方法是通过最近的分支从A到B遍历树。因为在Newick格式中，我们得到了每个叶子和树枝之间的距离，从A到B的距离很简单

0.1+0.2=0.3

。对于A到D，我们必须做

0.1+（0.5+0.4）=1.0

，因为从D到最近分支的距离为0.4，从D的分支到A的距离为0.5。因此，距离矩阵如下所示（索引

A=0

，

B=1

，

C=2

，

D=3

）：

从这里，很容易找到连杆矩阵。由于我们已经有了

n=4

簇（

，

）作为原始观察值，我们需要找到树的额外

n-1

簇。每一步只是将两个簇合并成一个新簇，我们取彼此最接近的两个簇。在这种情况下，A和B最接近，因此连杆矩阵的第一行如下所示：

[A,B,0.3,2]

从现在起，我们将A和B视为一个集群，其到最近分支的距离就是A和B之间的距离

现在我们剩下3个集群，

AB

，

，和

。我们可以更新距离矩阵以查看哪些簇最接近。让

AB

在更新的距离矩阵中具有索引

distance_matrix=
[[0.0, 1.1, 1.2],
 [1.1, 0.0, 0.7],
 [1.2, 0.7, 0.0]]

现在我们可以看到C和D彼此最接近，所以让我们将它们组合成一个新的集群。链接矩阵中的第二行现在是

[C,D,0.7,2]

现在，我们只剩下两个集群，

AB

和

CD

。从这些簇到根分支的距离分别为0.3和0.7，因此它们的距离为1.0。连杆矩阵的最后一行为：

[AB,CD,1.0,4]

现在，scipy矩阵实际上不会像我在这里展示的那样有字符串，我们会使用索引方案，因为我们首先组合A和B，

AB

会有索引4，

CD

会有索引5。因此，我们应该在scipy链接矩阵中看到的实际结果是：

[[0,1,0.3,2],
 [2,3,0.7,2],
 [4,5,1.0,4]]

这是从树表示到scipy链接矩阵表示的一般方法。然而，其他python软件包中已经有了以Newick格式读取树的工具，从这些工具中，我们可以相当容易地找到距离矩阵，然后将其传递给scipy的链接函数。下面是一个小脚本，在这个例子中就是这样做的

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)

我找到了这个解决方案：

将numpy导入为np
作为pd进口熊猫
从ete3导入ClusterTree
从scipy.spatial.distance导入pdist
从scipy.cluster.hierarchy导入链接
导入日志记录
def newick_至_链接（newick:str，标签顺序：[str]=None）->（np.ndarray[str]）：
"""
将newick树转换为scipy链接矩阵
：param newick:newick字符串，例如“（A:0.1，B:0.2，（C:0.3，D:0.4）：0.5）；”
：param label_order：标签列表，例如[A]、[B]、[C']
：返回：链接矩阵和标签列表
"""
#newick字符串->余元矩阵
树=簇树（newick）
余元矩阵，newick_labels=tree.余元矩阵（）
cophenetic_矩阵=pd.DataFrame（cophenetic_矩阵，列=newick_标签，索引=newick_标签）
如果标签顺序不是无：
#健康检查
缺少标签=设置（标签顺序）。差异（设置（新标签））
多余标签=集合（新标签）。差异（集合（标签顺序））
assert len（missing_labels）==0，f'某些标签不在newick字符串中：{missing_labels}'
如果len（多余的_标签）>0：
警告（f'Newick string包含未使用的标签：{多余的标签}）
#重新排列协阵
cophenetic_矩阵=cophenetic_矩阵.reindex（索引=标签顺序，列=标签顺序）
#将平方距离矩阵化简为压缩距离矩阵
成对距离=pdist（共线性矩阵）
#返回链接矩阵和标签
返回链接（成对距离），列表（共元矩阵列）

基本用法：

链接矩阵，标签=新建链接( …newick='（A:0.1，B:0.2，（C:0.3，D:0.4）：0.5）；' ... ) >>>打印（链接矩阵） [[0. 1. 0.4472136 2. ] [2. 3. 1. 2. ] [4. 5. 1.4832397 4. ]] >>>打印（标签） ['A'，'B'，'C'，'D'] 什么是共生矩阵

from ete2 import ClusterTree, TreeStyle
import scipy.cluster.hierarchy as sch
import scipy.spatial.distance
import matplotlib.pyplot as plt
import numpy as np
from itertools import combinations


tree = ClusterTree('(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);')
leaves = tree.get_leaf_names()
ts = TreeStyle()
ts.show_leaf_name=True
ts.show_branch_length=True
ts.show_branch_support=True

idx_dict = {'A':0,'B':1,'C':2,'D':3}
idx_labels = [idx_dict.keys()[idx_dict.values().index(i)] for i in range(0, len(idx_dict))]

#just going through the construction in my head, this is what we should get in the end
my_link = [[0,1,0.3,2],
        [2,3,0.7,2],
        [4,5,1.0,4]]

my_link = np.array(my_link)


dmat = np.zeros((4,4))

for l1,l2 in combinations(leaves,2):
    d = tree.get_distance(l1,l2)
    dmat[idx_dict[l1],idx_dict[l2]] = dmat[idx_dict[l2],idx_dict[l1]] = d

print 'Distance:'
print dmat


schlink = sch.linkage(scipy.spatial.distance.squareform(dmat),method='average',metric='euclidean')

print 'Linkage from scipy:'
print schlink

print 'My link:'
print my_link

print 'Did it right?: ', schlink == my_link

dendro = sch.dendrogram(my_link,labels=idx_labels)
plt.show()

tree.show(tree_style=ts)

>>> print(cophenetic_matrix)
     A    B    C    D
A  0.0  0.3  0.9  1.0
B  0.3  0.0  1.0  1.1
C  0.9  1.0  0.0  0.7
D  1.0  1.1  0.7  0.0