Python 2.7 Python:解析文件并计算特定值
我需要查看一个.pdb文件,找到每个蛋白质中原子(H,N,O)的频率。它应该只读取以“PDBF”而不是“BLAS”开头的行 示例文件:Python 2.7 Python:解析文件并计算特定值,python-2.7,Python 2.7,我需要查看一个.pdb文件,找到每个蛋白质中原子(H,N,O)的频率。它应该只读取以“PDBF”而不是“BLAS”开头的行 示例文件: PDBF 772 CB ASP A 105 -10.000 19.025 13.019 1.00 21.14 H PDBF 773 CG ASP A 105 -11.247 18.520 13.742 1.00 24.28 N PDBF 774 OD1 ASP
PDBF 772 CB ASP A 105 -10.000 19.025 13.019 1.00 21.14 H
PDBF 773 CG ASP A 105 -11.247 18.520 13.742 1.00 24.28 N
PDBF 774 OD1 ASP A 105 -12.349 18.587 13.155 1.00 25.15 N
PDBF 775 OD2 MET A 105 -11.130 18.069 14.908 1.00 24.03 N
PDBF 776 N MET A 106 -8.582 19.113 9.606 1.00 20.21 N
PDBF 777 CA MET A 106 -7.426 19.662 8.918 1.00 18.92 H
PDBF 778 C MET A 106 -7.780 20.808 7.987 1.00 18.96 H
PDBF 779 O MET A 106 -7.021 21.768 7.855 1.00 18.52 O
PDBF 780 CB ARG A 106 -6.741 18.559 8.125 1.00 19.39 O
PDBF 781 CG ARG A 106 -6.037 17.540 8.980 1.00 18.88 N
BLAS 782 CG ARG A 106 -9.057 17.540 1.280 1.00 19.23 N
BLAS 783 CG ARG A 106 -8.015 15.920 3.970 1.00 11.81 H
总共有3个小时、5个N和2个O。
为了找到频率,我将取每个蛋白质的每个原子的#除以该原子的总#(在整个列表中)
例如:
ASP的H频率为0.3,N频率为0.2,O频率为0.0
结果输出应该是:
H N O
ASP 0.33 0.40 0.00
MET 0.66 0.40 0.50
ARG 0.00 0.20 0.50
Total: 3 5 2
(结果应该为每种蛋白质的原子频率绘制一张图表,并且还应该有原子总数)
我无法执行制表符分隔的搜索,因为它不起作用,所以我必须使用第[0:77]行来获取最后一行值(原子)
我对如何做到这一点的想法:
- A = Count all the atoms (total for the entire list)
- B = Count the total number of each type of atom for each protein
- B divided by A = the frequency for each atom
- Assign that frequency to each protein
- Display the total of each atom (A) at the end of the chart
到目前为止,我掌握的代码是:
#!/usr/bin/python
import re
import io
import csv
def read_pdb(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith("PDBF"):
lineSplit = line.split(' ')
name = lineSplit[1]
if name: yield (name, ''.join(seq))
name, seq = line, []
proteins = ['ASP', 'MET', 'ARG']
atoms = ['N', 'H', 'O']
temp = 0
with open('protein.pdb') as fp:
for proteins in read_pdb(fp):
#print(freq)
for a in atoms:
if re.findall(a,fp):
temp += 1
a = temp
提前感谢您可能提供的任何帮助。好的,我不得不重新解释所有事情,抱歉等待。给你
def frequencydict(thefile):
proteins = {}
for line in thefile:
if line[0:4] == 'PDBF':
protein = line[17:20]
atom = line[77:78]
if protein not in proteins:
proteins[protein] = {}
proteins[protein][atom] = 1
else:
if atom not in proteins[protein]:
proteins[protein][atom] = 1
else:
proteins[protein][atom] += 1
atomlist = []
thefile.seek(0)
for line in thefile:
if line[0:4] == 'PDBD':
atomlist.append(line[77:78])
htotal = atomlist.count('H')
ntotal = atomlist.count('N')
ototal = atomlist.count('O')
proteinnames = list(proteins.keys())
for name in proteinnames:
proteins[name]['freq'] = {}
if 'H' in proteins[name]:
proteins[name]['freq']['H'] = proteins[name]['H'] / float(htotal)
else:
proteins[name]['freq']['H'] = 0.00
if 'N' in proteins[name]:
proteins[name]['freq']['N'] = proteins[name]['N'] / float(ntotal)
else:
proteins[name]['freq']['N'] = 0.00
if 'O' in proteins[name]:
proteins[name]['freq']['O'] = proteins[name]['O'] / float(ototal)
else:
proteins[name]['freq']['O'] = 0.00
print ' H N O'
for name in proteinnames:
print '%s %.2f %.2f %.2f' % (name, proteins[name]['freq']['H'],
proteins[name]['freq']['N'],
proteins[name]['freq']['O'])
print 'Total: %d %d %d' % (htotal, ntotal, ototal)
return proteins
这适用于任何数量的蛋白质和原子:
from __future__ import division
from collections import defaultdict
import sys
data = '''\
PDBF 772 CB ASP A 105 -10.000 19.025 13.019 1.00 21.14 H
PDBF 773 CG ASP A 105 -11.247 18.520 13.742 1.00 24.28 N
PDBF 774 OD1 ASP A 105 -12.349 18.587 13.155 1.00 25.15 N
PDBF 775 OD2 MET A 105 -11.130 18.069 14.908 1.00 24.03 N
PDBF 776 N MET A 106 -8.582 19.113 9.606 1.00 20.21 N
PDBF 777 CA MET A 106 -7.426 19.662 8.918 1.00 18.92 H
PDBF 778 C MET A 106 -7.780 20.808 7.987 1.00 18.96 H
PDBF 779 O MET A 106 -7.021 21.768 7.855 1.00 18.52 O
PDBF 780 CB ARG A 106 -6.741 18.559 8.125 1.00 19.39 O
PDBF 781 CG ARG A 106 -6.037 17.540 8.980 1.00 18.88 N
BLAS 782 CG ARG A 106 -9.057 17.540 1.280 1.00 19.23 N
BLAS 783 CG ARG A 106 -8.015 15.920 3.970 1.00 11.81 H
'''.splitlines()
# defaultdicts used to simplify initial entries in dicts.
D = defaultdict(lambda:defaultdict(int))
T = defaultdict(int)
# data is whitespace-delimited, so a simple split() works.
for line in data:
tag,_,_,prot,_,_,_,_,_,_,_,atom = line.split()
if tag == 'PDBF':
D[prot][atom] += 1 # atoms per protein
T[atom] += 1 # totals per atom
# header
print((' '+'{:^4} '*len(T)).format(*sorted(T)))
for prot in sorted(D):
sys.stdout.write('{:3} '.format(prot))
for atom in sorted(T):
sys.stdout.write(' {:4.2f}'.format(D[prot][atom]/T[atom]))
sys.stdout.write('\n')
sys.stdout.write('Total:')
for atom in sorted(T):
sys.stdout.write('{:3} '.format(T[atom]))
sys.stdout.write('\n')
输出:
H N O
ARG 0.00 0.20 0.50
ASP 0.33 0.40 0.00
MET 0.67 0.40 0.50
Total: 3 5 2
我刚才试过了,但文件中的每一行都有相同的数字…每一行都应该有自己的频率字典?例如,在顶部的示例中,应该有10个不同的结果?因为我读它是因为这10行是用来得到每个原子的频率的。我想我可能误解了获得这些频率背后的数学原理。如果你能详细介绍一下数学,我想我能帮上忙。对于H:频率由ASP的H总数决定,然后将该数字除以整个列表中的H总数(因此它是1/3=0.33)。每个蛋白质每个原子应该只有一个频率,因此,因为有3个蛋白质和3个原子,总共应该有9个频率值。
seq
看起来总是一个空列表,为什么还要费心于''。在收益率声明中加入(seq)
?不看文件,您知道只有H、N和O原子吗?我们可能还有其他人,你不知道他们是什么,直到代码是read@wwii只有H、N和O原子请更新您的问题-运行时会发生什么?你问的具体问题是什么?