Python正则表达式Findall前瞻_Python_Regex_Findall

Python正则表达式Findall前瞻

python regex

Python正则表达式Findall前瞻,python,regex,findall,Python,Regex,Findall,我创建了一个函数，用于搜索蛋白质字符串以查找打开的阅读框。这是： def orf_finder(seq,format): record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is. string = [] #creates an empty list for i in range(3): string.append(record

我创建了一个函数，用于搜索蛋白质字符串以查找打开的阅读框。这是：

def orf_finder(seq,format):
    record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is.
    string = [] #creates an empty list

    for i in range(3):
        string.append(record.seq[i:]) #creates a list of three lists, each holding a different reading frame.

        protein_string = [] #creates an empty list
        protein_string.append([str(i.translate()) for i in string]) #translates each list in 'string' and combines them into one long list
        regex = re.compile('M''[A-Z]'+r'*') #compiles a regular expression pattern: methionine, followed by any amino acid and ending with a stop codon.
        res = max(regex.findall(str(protein_string)), key=len) #res is a string of the longest translated orf in the sequence.
        print "The longest ORF (translated) is:\n\n",res,"\n"
        print "The first blast result for this protein is:\n"

        blast_records = NCBIXML.parse(NCBIWWW.qblast("blastp", "nr", res)) #blasts the sequence and puts the results into a 'record object'.
        blast_record = blast_records.next()

        counter = 0 #the counter is a method for outputting the first blast record. After it is printed, the counter equals '1' and therefore the loop stops.
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if counter < 1: #mechanism for stopping loop
                   print 'Sequence:', alignment.title
                   print 'Sength:', alignment.length
                   print 'E value:', hsp.expect
                   print 'Query:',hsp.query[0:]
                   print 'Match:',hsp.match[0:]
                   counter = 1

def orf_finder（顺序，格式）：
record=SeqIO.read（seq，format）#按顺序读取并告诉biopython它是什么格式。
string=[]#创建一个空列表
对于范围（3）中的i：
string.append（record.seq[i:]）#创建一个包含三个列表的列表，每个列表包含不同的阅读框。
protein_string=[]创建一个空列表
protein_string.append（[str（i.translate（））表示字符串中的i]）#转换“string”中的每个列表，并将它们组合成一个长列表
regex=re.compile（'M'[A-Z]'+r'*'）#编译一个正则表达式模式：蛋氨酸，后跟任何氨基酸，以终止密码子结尾。
res=max（regex.findall（str（protein_string）），key=len）#res是序列中翻译最长的orf字符串。
打印“最长的ORF（翻译）是：\n\n”，res，“\n”
打印“此蛋白质的第一个blast结果是：\n”
blast_records=NCBIXML.parse（NCBIWWW.qblast（“blastp”，“nr”，res））#对序列进行爆破，并将结果放入“记录对象”。
blast_record=blast_records.next（）
计数器=0#计数器是输出第一个爆炸记录的方法。打印后，计数器等于“1”，因此循环停止。
对于blast_记录中的对齐。对齐：
对于alignment.hsps中的hsp：
如果计数器<1:#停止循环的机制
打印“序列：”，alignment.title
打印'Sength:'，alignment.length
打印“E值：”，hsp.expect
打印“查询：”，hsp.Query[0:]
打印“匹配：”，hsp.Match[0:]
计数器=1

唯一的问题是，我不认为我的正则表达式，

re.compile（'M'[A-Z]'+r'*'）

没有找到重叠的匹配项。我知道一个lookahead子句，

？=

，可能会解决我的问题，但我似乎无法在不返回错误的情况下实现它

有人知道我怎样才能让它工作吗

上面的代码使用biopython读取DNA序列，翻译它，然后搜索一个蛋白质读入框；以“M”开头，以“*”结尾的序列

re.compile(r"M[A-Z]+\*")

假设搜索的字符串以“M”开头，后跟一个或多个大写字母“A-Z”，并以“*”结尾

您的代码缩进不良。尝试使用

cat foo.py | sed-e“s/^//g”| pbcopy重新粘贴

大多数代码与问题无关。请给出一些输入字符串的示例以及您希望从中获得的匹配项。该代码接收DNA序列，如：“ACTGGAACCTGTTTA”，并将其转换为蛋白质序列，如：“MWDVIRSYPWHPTQMMGAPERGED*”。我在寻找蛋白质序列中最长的字符串，它以字母“M”开头，以星号“*”结尾。该序列也可以包含“M”。简而言之，我在寻找“M”和星号之间的最长距离。只有最长的字符串？您提到要查找重叠的匹配项。请澄清。（如果有多个M和多个星号怎么办？你想要多个结果，哪一个？）我想找到所有的例子，重叠的和不重叠的，这样我就不会错过任何东西。然后，从这个列表中，我将取最大值。我想知道输入

MABCMDEF*GHI*

将产生什么结果。OP一方面说“寻找以M开头，以*结尾的最长字符串”，另一方面“序列也可以包含M”（因此，不允许包含星号？）。不，不允许包含星号-除了星号之外，任何内容都可以包含在序列中