Python 蛋白质序列的概率矩阵

Python 蛋白质序列的概率矩阵,python,bioinformatics,Python,Bioinformatics,我正试图建立这个序列的概率矩阵: “DITCCGQFFAIIYHDWQYKIFRYATSPVKEPWKHRMWYSIVAA和VENCNSFHgpyQQ KHQWQNDAQYLEYKTIGYQKRDQPNVWIHHPMVYYEPVHYRQFDQFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCECERARYQCNSWRKSIADPEFNYCCIAWMFCKHEEKGEETRCE

我正试图建立这个序列的概率矩阵:

“DITCCGQFFAIIYHDWQYKIFRYATSPVKEPWKHRMWYSIVAA和VENCNSFHgpyQQ KHQWQNDAQYLEYKTIGYQKRDQPNVWIHHPMVYYEPVHYRQFDQFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCECERARYQCNSWRKSIADPEFNYCCIAWMFCKHEEKGEETRCEQKHQ Allpphedygdslndcqvnnegpyttkgeqrvklqkeghkneqcrkatkrkyqasqceak remmknwrsytatesnarvmqhwrqwrlhmcvitddhtqrretceakenrmlrtalhiw VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQLCPNEDQCPMSIF HykvKmCWewrivyapnqshtrncaldfkkmipagmmhcpgmqsgmltsdrpvlepgsv ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF'

一种氨基酸序列的蛋白质

以字母形式显示:

字母表=‘A’、‘C’、‘D’、‘E’、‘F’、‘G’、‘H’、‘I’、‘K’、‘L’、‘M’、‘N’、‘p’、‘Q’、‘R’、‘S’、‘T’、‘V’、‘W’、‘Y’

我创建了一个空矩阵:

prob_matrix = {}
for i in alphabet:
    prob_matrix[i] = {}
    for j in alphabet:
        prob_matrix[i][j] = 0.0
但是我正在努力用基于我的序列的数字来填充这个矩阵。 有人能帮我用这个配方吗

然后我可以用这个函数把它转换成概率:

for row in prob_matrix:
   total = sum([prob_matrix[row][column] for column in prob_matrix[row]])
   if total > 0:
       for column in prob_matrix[row]:
           prob_matrix[row][column] /= total
如果这是正确的话


谁能帮我把这中间的一步插进去吗?或者帮我创建一个全新的公式?

我可能会先从序列字符串中去掉空白:

sequence = "DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF"

stripped_sequence = sequence.replace(" ", "")
然后使用以下公式获得序列中每个字母的总数:

现在你可以通过将每个序列字母的数量除以序列字母的总数得到概率:

sequence_len = len(stripped_sequence)
probabilities = {
    letter: totals[letter] / sequence_len for letter in alphabet
}

print(probabilities)
输出以下概率:

{'A': 0.059033989266547404, 'C': 0.055456171735241505, 'D': 0.04293381037567084, 'E': 0.07334525939177101, 'F': 0.046511627906976744, 'G': 0.03756708407871199, 'H': 0.05008944543828265, 'I': 0.04293381037567084, 'K': 0.057245080500894455, 'L': 0.04114490161001789, 'M': 0.04293381037567084, 'N': 0.06082289803220036, 'P': 0.03935599284436494, 'Q': 0.06618962432915922, 'R': 0.0518783542039356, 'S': 0.057245080500894455, 'T': 0.046511627906976744, 'V': 0.04114490161001789, 'W': 0.03756708407871199, 'Y': 0.05008944543828265}

其总和应接近1

我可能会先从序列字符串中去掉空白:

sequence = "DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF"

stripped_sequence = sequence.replace(" ", "")
然后使用以下公式获得序列中每个字母的总数:

现在你可以通过将每个序列字母的数量除以序列字母的总数得到概率:

sequence_len = len(stripped_sequence)
probabilities = {
    letter: totals[letter] / sequence_len for letter in alphabet
}

print(probabilities)
输出以下概率:

{'A': 0.059033989266547404, 'C': 0.055456171735241505, 'D': 0.04293381037567084, 'E': 0.07334525939177101, 'F': 0.046511627906976744, 'G': 0.03756708407871199, 'H': 0.05008944543828265, 'I': 0.04293381037567084, 'K': 0.057245080500894455, 'L': 0.04114490161001789, 'M': 0.04293381037567084, 'N': 0.06082289803220036, 'P': 0.03935599284436494, 'Q': 0.06618962432915922, 'R': 0.0518783542039356, 'S': 0.057245080500894455, 'T': 0.046511627906976744, 'V': 0.04114490161001789, 'W': 0.03756708407871199, 'Y': 0.05008944543828265}

其总和应接近1

这将获得过渡频率,然后可以将其转换为概率:

for i, j in zip(sequence[:-1], sequence[1:]):
    prob_matrix[i][j] += 1

zipstripped_序列[:-1],stripped_序列[1:]生成表示转换的氨基酸对列表,例如,['D','I','I','T',…]。它的工作原理是将缺失最后一个氨基酸的序列中的氨基酸与缺失第一个氨基酸的序列配对。

这应该可以得到过渡频率,然后可以将其转换为概率:

for i, j in zip(sequence[:-1], sequence[1:]):
    prob_matrix[i][j] += 1

zipstripped_序列[:-1],stripped_序列[1:]生成表示转换的氨基酸对列表,例如,['D','I','I','T',…]。它的工作原理是将缺失最后一个氨基酸的序列中的氨基酸与缺失第一个氨基酸的序列配对。

您想要什么概率?转移概率?@petercollingridge是的。你想要什么概率?转移概率?@petercollingridge是的。