seqidpython中的计数特定模式

seqidpython中的计数特定模式,python,python-3.x,biopython,Python,Python 3.x,Biopython,事实上,我有一个巨大的multifasta seq文件,例如: >Seq_1_0035_0035 ATTGGAT >Seq_2_0042_0035 ATTGAGGA >EOGWX56TR_0035_0042 (busco) ATGGAGAT >EOGWX56TR_0042_0042 (busco) ATGGATGG >Seq6_035_0042 ATGGGAATAG >EOG55FTG_0035_0042 (busco) AATGGATA >EOG5GF

事实上,我有一个巨大的multifasta seq文件,例如:

>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT
实际上,我想计算一下我的文件中有多少Busco基因(它们都以名字
>EOG
开头),这样我就有了这样一个脚本:

count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
    count+=1
print(count)

set_of_labels = set()

with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
  for line in f:
    if line.startswith('>EOG'):
      label = line[4:].split('_')[0]
      set_of_labels.add(label)

print("Total number of Busco genes: " + str(len(set_of_labels)))
但我也想知道我在每个相应的成分之间有多少个基因。我解释得更好

正如您所看到的,每个seqID
中有两个数字,例如_number\u number
这些数字是特定的,第一个
\u数字
对应于序列所属的物种,第二个
\u数字
是特定的数字。 不管怎样,我想知道是否有可能像我那样计算我得到的第一个数字
\u 0035
\u 0042
而且 seq ID有多少个:

_0035_0042
_0035_0042
_0042_0042
_0042_0035
在上述示例中,它将是:

Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie  _0035_0042 : 3
Total busco for the specific specie  _0042_0035 : 0
Total busco for the specific specie  _0042_0042 : 4
Total busco for the specific specie  _0035_0035 : 0
您好,希望这是清楚的,事实上第一部分(
total busco:
)已经由我的脚本完成了,我只需要数一数其他7种方式


这是真实数据

除了busco计数器外,您还可以使用多个计数器获取物种和特定物种的单个计数,例如:

import collections

busco = collections.defaultdict(int)  # busco counter
species = collections.defaultdict(int)  # species counter
specific_species = collections.defaultdict(int)  # specific species counter

with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
    for line in f:
        if line[:4] == ">EOG":
            entry = line.split()[0][4:].split('_')
            busco[entry[0]] += 1
            species[entry[1]] += 1
            specific_species[entry[1] + "_" + entry[2]] += 1

print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
    print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
    print("Total busco for the specific specie {}: {}".format(specie, total))
这将产生:

Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0042: 4 这将产生:

Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0035: 0 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0035: 0 Total busco for the specific specie 0042_0042: 4 对于您的完整数据打印:

Total busco: 421 Total busco for the specie 0035: 402 Total busco for the specie 0042: 397 Total busco for the specific specie 0035_0035: 392 Total busco for the specific specie 0035_0042: 262 Total busco for the specific specie 0042_0035: 305 Total busco for the specific specie 0042_0042: 383 巴士公司总数:421 该物种的总busco 0035:402 该物种的总busco 0042:397 特定物种的总busco 0035_0035:392 特定物种的总busco 0035_0042:262 特定物种的总busco 0042_0035:305
特定种类0042_0042:383的总busco,这与Python标准库中的类无关:

from collections import Counter
from io import StringIO

label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()

# replace this with an open() on your real file 
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT""")



for line in finput:
    try:
        if line.startswith('>EOG'):
            label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
            label_counter[label] += 1
            specy_counter[specy] += 1
            specific_specy_counter[(specy, specific)] += 1
    except ValueError:
        print("Invalid line:", line)


print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
    print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
    print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))

请注意,值为0的物种或特定物种不会显示。

使用两个存储匹配数的词典如何?一个有键
\u number\u number
和一个键
\u number
?嗨,首先感谢你的帮助:)但似乎出了问题,因为我得到了很多数字,我的数据中只有422>EOG一些不同的值,实际上我得到了854,当我用if行[:4]=>EOG替换if行[:4]==“>EOG090X03YJ”:为了查看代码是否有效,我得到了0,我应该得到tot busco=10042\u 0042=1;0042\u 0035=0;0035\u 0042=0和_0035\u 0035=0。如果你想检查,我添加了我的真实数据:)你没有说明(或者至少我不清楚)您希望其他两个计数器也依赖于它们的父计数器,即仅对唯一的
busco
条目计数
species/specific\u specie
。使用上述代码,如果有两个
>EOG01\u 02\u 03
条目,则只会对一个
busco
条目进行计数(
01
),但物种(
02
)而特定物种(
02_03
)将在每次遭遇时增加一个,导致计数为
2
。是的,总的来说,这个数字是正确的(只计算所有物种和特定物种中发生过一次Busco)但是对于物种的总数,例如,如果有像EOG01和EOG01和EOG01,我只想数一个busco),那么更具体地说,我想在每种情况下进行计数,那么这里就是:
\u02\u03=1和
\u02\u04=1。也许你可以取一个l为了更好地理解,我在我的第一篇文章中使用了一种类似的方式查看了真实的数据?我查看了数据,但我仍然不清楚您试图做什么-busco的
计数是唯一的,因为它的键存储在
dict
中(但是您可以通过
总和获得所有遇到的
busco
条目的计数)(busco.values())
)但是后面的数字是分开处理的。是否只在唯一的
busco
条目上计算它们(例如,如果有两个条目,如:
>EOG01_02_03
>EOG01_02_04
则应计算一个busco(
01
)、一个物种(
02
)和两个特定物种(
02\u 03
02\u 04
)?是的,在您的示例中有3级计数,第一级仅一个busco条目(
all busco
),第二级取决于种类(
第一级
)但是如果有
EOG01\U 02\U 03
EOG01\U 02\U 04
你只能有一个公交中心和
第三个公交中心:你可以计算所有不同的公交中心:),那么因为有4个特定的种类,它就有6个不同的值来计算。
Total busco: 421
Total busco for the specie 0035: 402
Total busco for the specie 0042: 397
Total busco for the specific specie 0035_0035: 392
Total busco for the specific specie 0035_0042: 262
Total busco for the specific specie 0042_0035: 305
Total busco for the specific specie 0042_0042: 383
from collections import Counter
from io import StringIO

label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()

# replace this with an open() on your real file 
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT""")



for line in finput:
    try:
        if line.startswith('>EOG'):
            label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
            label_counter[label] += 1
            specy_counter[specy] += 1
            specific_specy_counter[(specy, specific)] += 1
    except ValueError:
        print("Invalid line:", line)


print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
    print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
    print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))