Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby-on-rails-4/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
只有我的一些物种被转换为NCBI ID,使用biopython将物种转换为ID_Python_Bioinformatics_Biopython_Ncbi - Fatal编程技术网

只有我的一些物种被转换为NCBI ID,使用biopython将物种转换为ID

只有我的一些物种被转换为NCBI ID,使用biopython将物种转换为ID,python,bioinformatics,biopython,ncbi,Python,Bioinformatics,Biopython,Ncbi,我有一些代码,它将一个物种名称从带有下划线的列表中剥离出来,转换成适合NCBI的格式,然后搜索与该物种名称关联的ID,但是由于某些原因,这无法处理我输入文件中的每个条目。我已经附加了我的代码、输入文件的子集和输出文件的子集 from Bio import Entrez import time Entrez.email = 'fake.email@isp.com' def get_tax_id(species): species = species.replace('_', '+')

我有一些代码,它将一个物种名称从带有下划线的列表中剥离出来,转换成适合NCBI的格式,然后搜索与该物种名称关联的ID,但是由于某些原因,这无法处理我输入文件中的每个条目。我已经附加了我的代码、输入文件的子集和输出文件的子集

from Bio import Entrez
import time


Entrez.email = 'fake.email@isp.com'

def get_tax_id(species):
    species = species.replace('_', '+').strip()
    search = Entrez.esearch(term=species, db='taxonomy', retmode='xml')
    record = Entrez.read(search)
    return record['IdList']

current_time = time.strftime("%d.%m.%y %H:%M", time.localtime())

output_name = 'test#%s.txt' % current_time

file = open(output_name, "w+")

listoforganisms = [x.split('\t')[0] for x in open("OGTlist.csv").readlines()]

if __name__ == '__main__':
    organisms = listoforganisms
    for organism in organisms:
        taxid = get_tax_id(organism)
        stringid = str(taxid)
        strippedid = stringid.strip("'[]'")
        if len(stringid) <= 2:
            file.write('\n' + str(organism) + ',ERROR_no_ID_match')
        else:
            file.write('\n' + str(organism) + ',' + str(strippedid))
我从中获取物种名称的文件如下所示:

micromonospora_inyonensis,47866
viola_arvensis,97415
amycolatopsis_albidoflavus,102226
tetragenococcus_koreensis,290335
panaeolus_papilionaceus,330517
geomys_pinetis,100306
vibrio_lutjanus,ERROR_no_ID_match
succiniclasticum_ruminis,40841
microtetraspora_malaysiensis,161358
blarina_carolinensis,183658
amycolatopsis_palatopharyngis,187982
rhodosporidium_toruloides,5286
geobacter_bemidjiensis,225194
acinetobacter_haemolyticus,29430
actinoplanes_tereljensis,571912
phyllostomus_hastatus,9423
phacidium_infestans,66518
dorea_formicigenerans,39486
hoeflea_marina,274592
naemacyclus_minor,64355
methanosaeta_thermophila,2224
pholiota_carbonaria,227966
sphingomonas_faeni,185950
helicobacter_pullorum,35818
solitalea_koreensis,543615
dermacoccus_profundi,322602
pseudomonas_pictorum,86184
actinomadura_livida,79909
leptonycteris_curasoae,55054
psychrobacter_salsus,219741
vibrio_inusitatus,413402
stereum_rameale,ERROR_no_ID_match
photorhabdus_temperata,574560
clitocybe_lignatilis,5634
actinocorallia_glomerata,46203
aspergillus_giganteus,5060
erwinia_amylovora,552
hydrogenoanaerobacterium_saccharovorans,474960
mycobacterium_aichiense,1799
nocardia_pneumoniae,228601
bacillus_pocheonensis,363869
streptomonospora_alba,183763
exobasidium_gracile,190086
phenylobacterium_zucineum,284016
amsonia_tabernaemontana,144544
rattus_fuscipes,10119
jannaschia_rubra,282197
hereroa_rehneltiana,ERROR_no_ID_match
micromonospora_inyonensis   28  DSMZ
viola_arvensis  23  DSMZ
amycolatopsis_albidoflavus  28  DSMZ
tetragenococcus_koreensis   28  DSMZ
panaeolus_papilionaceus 24  DSMZ
geomys_pinetis  36.3    white
vibrio_lutjanus 30  DSMZ
succiniclasticum_ruminis    37  DSMZ
microtetraspora_malaysiensis    28  DSMZ
blarina_carolinensis    36.8    white
amycolatopsis_palatopharyngis   28  DSMZ
rhodosporidium_toruloides   23  DSMZ
geobacter_bemidjiensis  30  DSMZ
acinetobacter_haemolyticus  28  DSMZ
actinoplanes_tereljensis    28  DSMZ
phyllostomus_hastatus   34.7    white
phacidium_infestans 25  DSMZ
dorea_formicigenerans   37  DSMZ
hoeflea_marina  28  DSMZ
naemacyclus_minor   22  DSMZ
methanosaeta_thermophila    58.3333333333   DSMZ
pholiota_carbonaria 25  DSMZ
sphingomonas_faeni  22  DSMZ
helicobacter_pullorum   37  DSMZ
solitalea_koreensis 28  DSMZ
dermacoccus_profundi    28  DSMZ
pseudomonas_pictorum    28  DSMZ
actinomadura_livida 28  DSMZ
leptonycteris_curasoae  35.7    white
psychrobacter_salsus    22  DSMZ
vibrio_inusitatus   28  DSMZ
stereum_rameale 20  DSMZ
photorhabdus_temperata  28.6666666667   DSMZ
clitocybe_lignatilis    25  DSMZ
actinocorallia_glomerata    28  DSMZ
aspergillus_giganteus   24.5    DSMZ
erwinia_amylovora   26.6666666667   DSMZ
hydrogenoanaerobacterium_saccharovorans 37  DSMZ
mycobacterium_aichiense 37  DSMZ
nocardia_pneumoniae 28  DSMZ
bacillus_pocheonensis   30  DSMZ
streptomonospora_alba   28  DSMZ
exobasidium_gracile 20  DSMZ
phenylobacterium_zucineum   30  DSMZ
amsonia_tabernaemontana 23  DSMZ
rattus_fuscipes 37.5    white
jannaschia_rubra    25  DSMZ
hereroa_rehneltiana 23  DSMZ

我的实际输入文件有大约2000个条目,答案是否简单到物种名称不正确,或者NCBI上不存在所有物种的ID,是否有人能通过编程解决此问题

第一个答案是物种名称不存在。您可以在ncbi网站上查看。就像这里:

如果你看看其他网站,卢贾纳斯弧菌似乎无论如何都不存在。例如或

没有解决方案可以克服这个问题(在查找分类单元id的情况下),但是您可以仔细检查名称是否正确。分类学是困难的,每个人都有不同的名字,还有很多同义词。您可以使用分类名称网站的api,如gbif或全局名称

[编辑]

如果物种不可用,您还可以检查该属的分类单元id。您可以在此处下载NCBI的分类信息:


您需要下载zip文件,可能还需要rankedlineage.dmp和merged.dmp文件。全局名称网站也可用于属级。不知道来自BioPython的entrez是否可以查找属级别的id,也许这也是一个选项

我改变了答案,补充了一些。海军陆战队网站不是一个很好的例子,所以我把它改成了席尔瓦。还为您添加了其他选项。