Python 2.7 查找字符串及其在文件中的位置_Python 2.7

Python 2.7 查找字符串及其在文件中的位置

python-2.7

Python 2.7 查找字符串及其在文件中的位置,python-2.7,Python 2.7,我想做三件事： 1) Print out the ID for each sequence 2) Find a particular motif in a sequence, print it out if it exists 3) Print out the index location for the motif in the sequence Sequence.fasta文件示例： >sp|Q12955|ANK3_HUMAN Ankyrin-3 OS=Homo sapiens GN

我想做三件事：

1) Print out the ID for each sequence
2) Find a particular motif in a sequence, print it out if it exists
3) Print out the index location for the motif in the sequence

Sequence.fasta文件示例：

>sp|Q12955|ANK3_HUMAN Ankyrin-3 OS=Homo sapiens GN=ANK3 PE=1 SV=3
MAHAASQLKKNRDLEINAEEEPEKKRKHRKRSRDRKKKSDANASYLRAARAGHLEKALDY
IKNGVDINICNQNGLNALHLASKEGHVEVVSELLQREANVDAATKKGNTALHIASLAGQA

>sp|Q16659|MK06_HUMAN Mitogen-activated protein kinase 6 OS=Homo sapiens GN=MAPK6 PE=1 SV=1
MAEKFESLMNIHGFDLGSRYMDLKPLGCGGNGLVFSAVDNDCDKRVAIKKIVLTDPQSVK
HALREIKIIRRLDHDNIVKVFEILGPSGSQLTDDVGSLTELNSVYIVQEYMETDLANVLE
QGPLLEEHARLFMYQLLRGLKYIHSANVLHRDLKPANLFINTEDLVLKIGDFGLARIMDP

>sp|Q7Z7A1|CNTRL_HUMAN Centriolin OS=Homo sapiens GN=CNTRL PE=1 SV=2
MKKGSQQKIFKHLQQPSSSHSPIPSSMSNMRSRSLSPLIGSETLPFHSGGQWCEQVEIAD
ENNMLLDYQDHKGADSHAGVRYITEALIKKLTKQDNLALIKSLNLSLSKDGGKKFKYIEN
LEKCVKLEVLNLSYNLIGKIEKLDKLLKLRELNLSYNKISKIEGIENMCNLQKLNLAGNE

在这个文件中，我想找到以下图案。在序列中可以有多个相同的图案。例如：

MAH..S
KK..D
FES.MN
K..QQ

因此，输出应为：

ID = Q12955
Motif = MAH..S
Location =[0] to [4]
Motif = KK..D
Location = [8] to [12]

ID = Q16659
Motif = FES.MN
Location = [4] to [9]

ID = Q7Z7A1
Motif = K..QQ
Location = [1] to [6]
Location = [10] to [14]

迄今为止的代码：

要查找ID，请执行以下操作：

f=open('pr_seq.fasta','r')

for idLine in f:
    if '>' in idLine:
        lineSplit = idLine.split('|')
        ID = lineSplit[1]
        print ID

要查找序列中的主题，请执行以下操作：

f=open('pr_seq.fasta','r') 
pr=[]

for motLine in f:
    if motLine[0]=='>':
        pr=motLine.split("\n")[1]

    else:
        try:
            pr+=motLine.strip()
        except:
            pr+=motLine.strip()

    print ("PROTEIN SEQUENCE")      
    print
    print (pr)
    print

要查找主题的索引位置，请执行以下操作：

motif= ['N.E.K..N', 'N.Y....E', 'S...D.PL', 'S..SS','S.S..S', 'F.FP'] 
indices=len(pr)
index=0

for a in motif:
    if re.findall(a,pr):
        print a
        mi = pr.index(a)

既然你解释了没有换行符，那就做grep：

grep MAH..S Sequence.fasta | grep -bo MAH..S
0:MAHAAS

grep KK..D Sequence.fasta | grep -bo KK..D
8:KKNRD
35:KKKSD

grep FES.MN Sequence.fasta | grep -bo FES.MN
4:FESLMN

grep K..QQ Sequence.fasta | grep -bo K..QQ
2:KGSQQ
10:KHLQQ

如果允许搜索两次模式，则获取以下附加信息：

grep -B1 K..QQ Sequence.fasta | awk -F"|" 'NR==1{print $2}'
Q7Z7A1

通过将模式的长度添加到位置来获取范围是很简单的

实际使用模块re而不是grep。我没有注意到您的问题被标记为Python。否则，执行grep的subprocess.call。在Python中，它将是：

import re
with open('Sequence.fasta') as f:
    lines = f.readlines()

for line in lines:
    m = re.match('MAH..S', line)
    if not m:
        continue
    print(m.start(), m.group())

获取正确的格式很简单，我把它留给您。这在换行符之间不匹配，但您说没有换行符。

如果您简化示例，我认为这将是有益的。换行符是否被视为空白？否则，您可以直接使用grep。否则，我将自己实现Knuth-Morris-Prath算法。在sequence.fasta文件中没有空格或换行符。那么，为什么要在示例中添加换行符呢？这样，如果有读者可以了解文件中的序列如何。每个序列以>号开始，下一行是实际序列。这就是fasta文件的开放链接：是的，它的python我在motif搜索中使用了re，我无法使用索引获取位置。我添加了python解决方案。