正则表达式Python变量
我有这样的数据:正则表达式Python变量,python,regex,bioinformatics,Python,Regex,Bioinformatics,我有这样的数据: >Px016979 MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS >Px016980 MQFIKKVLLIALTLSGAM
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
我想做的是,如果我的号码>Px016979,或者我可以得到它下面的数据。如下所示:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
我是Python新手
#coding:utf-8
import os,re
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
b = '>Px016979'
matchbj = re.match( r'$b(.*?)>',a,re.M|re.I)
print matchbj.group()
我的代码不能工作。我有两个问题:
re.match(r'>Px016797(.*?>),a,re.M | re.I)
它可以工作,但我需要使用变量谢谢。以下内容适用于您的每个条目:
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
for b in ['>Px016979', '>Px016980', '>Px002185', '>Px006321']:
re_search = re.search(re.escape(b) + r'(.*?)(?:>|\Z)', a, re.M|re.I|re.S)
print re_search.group()
这将显示以下内容:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
以下内容适用于您的每个条目:
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
for b in ['>Px016979', '>Px016980', '>Px002185', '>Px006321']:
re_search = re.search(re.escape(b) + r'(.*?)(?:>|\Z)', a, re.M|re.I|re.S)
print re_search.group()
这将显示以下内容:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
看起来您的数据是一个包含蛋白质序列的FASTA文件。因此,不要使用正则表达式,应该考虑安装。这是一个专门用于生物信息学应用和研究的图书馆 Biopython的目标是通过创建高质量、可重用的模块和类,使Python尽可能容易地用于生物信息学。Biopython功能包括用于各种生物信息学文件格式(BLAST、CLUSTAW、FASTA、Genbank等)的解析器、在线服务访问(NCBI、Expasy等)、通用和非通用程序接口(Clustalw、DSSP、MSMS等)、标准序列类、各种聚类模块、KD树数据结构等,甚至文档 使用BioPython,您可以通过以下方式从FASTA文件中提取给定标识符的序列:
from Bio import SeqIO
input_file = r'C:\path\to\proteins.fasta'
record_id = 'Px016979'
record_dict = SeqIO.to_dict(SeqIO.parse(input_file, 'fasta'))
record = record_dict[record_id]
sequence = str(record.seq)
print sequence
看起来您的数据是一个包含蛋白质序列的FASTA文件。因此,不要使用正则表达式,应该考虑安装。这是一个专门用于生物信息学应用和研究的图书馆 Biopython的目标是通过创建高质量、可重用的模块和类,使Python尽可能容易地用于生物信息学。Biopython功能包括用于各种生物信息学文件格式(BLAST、CLUSTAW、FASTA、Genbank等)的解析器、在线服务访问(NCBI、Expasy等)、通用和非通用程序接口(Clustalw、DSSP、MSMS等)、标准序列类、各种聚类模块、KD树数据结构等,甚至文档 使用BioPython,您可以通过以下方式从FASTA文件中提取给定标识符的序列:
from Bio import SeqIO
input_file = r'C:\path\to\proteins.fasta'
record_id = 'Px016979'
record_dict = SeqIO.to_dict(SeqIO.parse(input_file, 'fasta'))
record = record_dict[record_id]
sequence = str(record.seq)
print sequence
<>我也会考虑安装Pythython并检查Python的书,这是免费的生物学家。我曾与fastas合作过很多次,为了获得一个快速而肮脏的解决方案,您只需使用以下内容(保持代码的其余部分不变): 由于re.DOTALL标志,它基本上在多行上匹配,并查找'>'字符之间的任意数量的任何内容。
请注意,这将在列表中为您提供它们,而不是对象。在我的经验中,Re.Matt在人们学习的第一件事,但他们经常寻找Re.FunDALL的影响。 < P>我也会考虑安装Pythython,并检查Python的Python是免费的在线(生物学家)。我曾与fastas合作过很多次,为了获得一个快速而肮脏的解决方案,您只需使用以下内容(保持代码的其余部分不变): 由于re.DOTALL标志,它基本上在多行上匹配,并查找'>'字符之间的任意数量的任何内容。
请注意,这将在列表中为您提供它们,而不是对象。根据我的经验,re.match是人们学习的第一件事,但他们经常寻找re.findall的效果。正则表达式是实现这一点的正确工具吗?如果您从第一个实例中的文件中获取这些数据,那么我将依次考虑每个行,如果该行从>开始,则该行定义了一个键。如果没有,则该行是数据,我会将其附加到
d[key]
的值中(在开始时初始化d={}
,对于每个新键,d[key]=“”
)。构建dict后,您可以简单地将所查找的数据引用为d[key]
,或者如果键入d:,则使用检查它是否存在。正则表达式是否是正确的工具?如果您从第一个实例中的文件中获取这些数据,那么我将依次考虑每个行,如果该行从>开始,则该行定义了一个键。如果没有,则该行是数据,我会将其附加到d[key]
的值中(在开始时初始化d={}
,对于每个新键,d[key]=“”
)。构建dict后,您可以简单地将所查找的数据引用为d[key]
,或者如果键入d:
,则使用检查它是否存在。