用python进行复杂字符串过滤

用python进行复杂字符串过滤,python,trim,Python,Trim,我有一个很长的字符串,这是一个系统发育树,我想做一个非常具体的过滤 (Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((A

我有一个很长的字符串,这是一个系统发育树,我想做一个非常具体的过滤

(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
基本上每个
x@y
是一个
species@gene_id
信息。我想做的是减少这个,这样我只有
x
而不是
x@y

(Esy, Aar,(Spa,Cpl))...
我首先尝试拆分字符串,但问题是字符串对于我想要实现的目标有不同的“拆分点”,即某些部分
x@y
结尾,其他以
)结尾。
。我搜索了一个解决方案并看到了正则表达式操作,但我对Python还不熟悉,我不能确定这是否是我应该关注的。我还考虑了
strip()
,但似乎我需要为此指定要剥离的字符

主要的问题是,我并没有告诉Python遵循什么“模式”。唯一的问题是,所有物种ID都是3个字母,它们在
@
字符之前


有没有一种方法可以满足我的需求?如果你能帮我解决我的问题,我将非常高兴。提前感谢。

这种功能怎么样:

def parse_string(string):
    new_string = ''
    skip = False
    for char in string:
        if char == '@':
            skip = True
        if char == ',':
            skip = False
        if not skip or char in ['(', ')']:
            new_string += char
    return new_string
用字符串调用它:

string = '(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
尝试一下:

import re:

pat = re.compile(r'(\w{3})@')
txt = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
结果:

['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
如果您需要完整的结构,我们可以尝试移除不必要的部分:

pat = re.compile(r'(@|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
结果:

['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'

因为您正在尝试解析一个系统发育树,我强烈建议让BioPython为您完成繁重的工作

您可以很容易地解析和显示一个带有。然后它只是在所有树元素上迭代,并在“at”符号处拆分名称

因为Phylo希望输入在一个文件中,所以我们使用
io.StringIO
创建了一个类似于内存文件的对象。因此,获得完整的树非常容易

Phylo.read(io.StringIO,'newick')

为了检查解析后的树是否看起来正常,我使用
print(tree)
将其打印一次

现在,我们要更改包含
“@”
的所有节点名称。通过
tree.find_元素
我们可以访问所有节点。有些节点没有名称,有些节点可能不包含“@”。因此,为了格外小心,我们首先检查n.name中的n.name和“@”是否为
。只有这样,我们才能在
'@'
处拆分每个节点的名称,并只取其第一部分(索引0):
n.name=n.name.split('@')[0]

为了重新创建初始字符串表示,我们使用
Phylo.write

out=io.StringIO()
Phylo.写(树,出,“newick”)
打印(out.getvalue())
同样,
write
想要获取一个文件参数-如果我们只想要获取一个字符串,我们可以再次使用
StringIO
对象

完整代码:

====== before ======
Tree(rooted=False, weight=1.0)
    Clade(branch_length=0.0129090235079)
        Clade(branch_length=0.0726396855636, name='Esy@ESY15_g64743_DN3_SP7_c0')
        Clade(branch_length=0.137507902808, name='Aar@AA_maker7399_1')
        Clade(branch_length=0.0129090235079)
            Clade(branch_length=9.05326020871e-05)
                Clade(branch_length=0.0318934795022, name='Spa@Tp2g18720')
                Clade(branch_length=0.0273465005242, name='Cpl@CP2_g48793_DN3_SP8_c')
            Clade(branch_length=0.00328120860999)
                Clade(branch_length=0.00859075940423)
                    Clade(branch_length=0.0340484449097)
                        Clade(branch_length=0.0332592496158, name='Bst@Bostr_13083s0053_1')
                        Clade(branch_length=0.0150356382287)
                            Clade(branch_length=0.0205924636564)
                                Clade(branch_length=0.0328569260951, name='Aly@AL8G21130_t1')
                                Clade(branch_length=0.0391706378372, name='Ath@AT5G48370_1')
                            Clade(branch_length=0.00998579652059)
                                Clade(branch_length=0.0954469923893, name='Chi@CARHR183840_1')
                                Clade(branch_length=0.0570981548016, name='Cru@Carubv10026342m')
                    Clade(branch_length=0.0372829371381)
                        Clade(branch_length=0.0206478928557)
                            Clade(branch_length=0.0144626717872)
                                Clade(branch_length=0.00823215335663, name='Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
                                Clade(branch_length=0.0085462978729, name='Hlo@DN13684_c0_g1_i1_p1')
                            Clade(branch_length=0.0225079453622, name='Hla@DN22821_c0_g1_i1_p1')
                        Clade(branch_length=0.048590776459, name='Hse@DN23412_c0_g1_i3_p1')
                Clade(branch_length=1.00000050003e-06)
                    Clade(branch_length=0.0378509854703, name='Esa@Thhalv10004228m')
                    Clade(branch_length=0.0712272454125, name='Aal@Aa_G102140_t1')

==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;

导入io
来自Bio import Phylo
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
s=(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree=Phylo.read(io.StringIO,'newick')
打印('before'.center(20'='))
打印(树)
对于树中的n。查找元素()
如果n.name和n.name中的“@”:
n、 name=n.name.split('@')[0]
打印('result'.center(20'='))
out=io.StringIO()
Phylo.写(树,出,“newick”)
打印(out.getvalue())
输出:

====== before ======
Tree(rooted=False, weight=1.0)
    Clade(branch_length=0.0129090235079)
        Clade(branch_length=0.0726396855636, name='Esy@ESY15_g64743_DN3_SP7_c0')
        Clade(branch_length=0.137507902808, name='Aar@AA_maker7399_1')
        Clade(branch_length=0.0129090235079)
            Clade(branch_length=9.05326020871e-05)
                Clade(branch_length=0.0318934795022, name='Spa@Tp2g18720')
                Clade(branch_length=0.0273465005242, name='Cpl@CP2_g48793_DN3_SP8_c')
            Clade(branch_length=0.00328120860999)
                Clade(branch_length=0.00859075940423)
                    Clade(branch_length=0.0340484449097)
                        Clade(branch_length=0.0332592496158, name='Bst@Bostr_13083s0053_1')
                        Clade(branch_length=0.0150356382287)
                            Clade(branch_length=0.0205924636564)
                                Clade(branch_length=0.0328569260951, name='Aly@AL8G21130_t1')
                                Clade(branch_length=0.0391706378372, name='Ath@AT5G48370_1')
                            Clade(branch_length=0.00998579652059)
                                Clade(branch_length=0.0954469923893, name='Chi@CARHR183840_1')
                                Clade(branch_length=0.0570981548016, name='Cru@Carubv10026342m')
                    Clade(branch_length=0.0372829371381)
                        Clade(branch_length=0.0206478928557)
                            Clade(branch_length=0.0144626717872)
                                Clade(branch_length=0.00823215335663, name='Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
                                Clade(branch_length=0.0085462978729, name='Hlo@DN13684_c0_g1_i1_p1')
                            Clade(branch_length=0.0225079453622, name='Hla@DN22821_c0_g1_i1_p1')
                        Clade(branch_length=0.048590776459, name='Hse@DN23412_c0_g1_i3_p1')
                Clade(branch_length=1.00000050003e-06)
                    Clade(branch_length=0.0378509854703, name='Esa@Thhalv10004228m')
                    Clade(branch_length=0.0712272454125, name='Aal@Aa_G102140_t1')

==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;

Phylo的默认格式使用的数字少于原始树中的数字。为了保持数字不变,只需使用“%s”覆盖分支长度格式字符串:

Phylo.write(树,out,“newick”,格式为分支长度=“%s”)

如果需要输出中的括号,请尝试此正则表达式:

import re
regex = r"@[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"

print(re.sub(regex,"",phylogenetic_tree))
输出:

(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
您可以使用正则表达式:

import re 
s = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=@)|\(|\)"

result = re.findall(p, s)
您可以将结果作为列表,这样您就可以将其设置为字符串或对其执行任何操作

解释正在发生的事情:
p
是正则表达式模式
所以在这个模式中:
表示匹配任何单词
…(?=@)
表示匹配任何单词,直到我找到一个
的单词,而
@
,所以整个模式意味着在
@

|
语句,我在这里使用它来查找另一种模式

剩下的是查找

解析代码可能很难理解。通过结合语法和python,您可以编写可读的解析代码:

text = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"

import sys
import tatsu

grammar = """
start = things ';'
    ;

things = thing [ ',' things ]
    ;

thing = x '@' y ':' number
    | '(' things ')' ':' number
    ;

x = /\w+/
    ;

y = /\w+/
    ;

number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
    ;
"""

class Semantics:
    def x(self, ast):
        # the method name matches the rule name
        print('X =', ast)

parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)

看起来你需要想出一个更复杂的解析算法。我建议你阅读一下“标记化”和“解析”。我的想法是将整个系统发育树解析为我们称之为抽象语法树。然后你可以浏览该树并获得你想要的片段。你尝试过正则表达式匹配吗?简单地匹配“…@”似乎是合乎逻辑的。我对系统发育的主题一点也不熟悉,但我认为你的答案可能是ri首先,如果你能在回答中澄清这是如何得到OP想要的(只是
名称和结构)我认为这会更有帮助。现在你的答案看起来有点太吵了。这肯定是一个更全面的答案。虽然一个简单的答案解决了我的问题,但我肯定会研究Biopython以满足我的进一步需求。以防万一