Python 使用坐标(基因)文件的脚本
我有一个变体表(variation.txt),它是一个非常大的文件。染色体数目的第一列,第二列是变异的位置。我还有第二个文件annotation.txt,其中有37000个基因(1st column),它们的染色体数(2nd column),它们的起始和结束坐标(3rd column),以及一些细节 我必须将变异(基于染色体数目和位置)分配给基因。首先,它应该在两个文件中寻找匹配的染色体数目,如果匹配,变异的坐标应该在(包括)基因的起始和结束位置之内。我曾尝试在python中使用它,但它花费了很长时间。此外,我希望有一个修改后的输出,如下所示。基因可以有重叠坐标,一个给定的变异可以是多个重叠基因的一部分。请帮忙 variation.txtPython 使用坐标(基因)文件的脚本,python,bash,awk,Python,Bash,Awk,我有一个变体表(variation.txt),它是一个非常大的文件。染色体数目的第一列,第二列是变异的位置。我还有第二个文件annotation.txt,其中有37000个基因(1st column),它们的染色体数(2nd column),它们的起始和结束坐标(3rd column),以及一些细节 我必须将变异(基于染色体数目和位置)分配给基因。首先,它应该在两个文件中寻找匹配的染色体数目,如果匹配,变异的坐标应该在(包括)基因的起始和结束位置之内。我曾尝试在python中使用它,但它花费了很
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch01 711131 A G - - - - - - - -
SL3.0ch00 715124 G A - - - - - - - -
SL3.0ch00 719289 C T - - - - - - - -
SL3.0ch00 720926 A C - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
Solyc00g005000.3.1 SL3.0ch02 702600 702900 + Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1 SL3.0ch01 715100 715200 + Potassium channel
Solyc00g005050.3.1 SL3.0ch00 715150 715300 - UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1 SL3.0ch00 723741 724013 - LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1 SL3.0ch00 723800 723900 - LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1 SL3.0ch05 809593 813633 + UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1 SL3.0ch07 1061632 1061916 - LOW QUALITY:DYNAMIN-like 1B
Solyc00g005092.1.1 SL3.0ch01 1127794 1144385 + Serine/threonine phosphatase-like protein
Solyc00g005094.1.1 SL3.0ch00 1144958 1146952 - Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1 SL3.0ch00 1734562 1736567 + RWP-RK domain-containing protein
annotation.txt
SL3.0ch02 702679 C A - - - - - - - -
SL3.0ch01 711131 A G - - - - - - - -
SL3.0ch00 715124 G A - - - - - - - -
SL3.0ch00 719289 C T - - - - - - - -
SL3.0ch00 720926 A C - - - - - - - -
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED
Solyc00g005000.3.1 SL3.0ch02 702600 702900 + Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1 SL3.0ch01 715100 715200 + Potassium channel
Solyc00g005050.3.1 SL3.0ch00 715150 715300 - UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1 SL3.0ch00 723741 724013 - LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1 SL3.0ch00 723800 723900 - LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1 SL3.0ch05 809593 813633 + UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1 SL3.0ch07 1061632 1061916 - LOW QUALITY:DYNAMIN-like 1B
Solyc00g005092.1.1 SL3.0ch01 1127794 1144385 + Serine/threonine phosphatase-like protein
Solyc00g005094.1.1 SL3.0ch00 1144958 1146952 - Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1 SL3.0ch00 1734562 1736567 + RWP-RK domain-containing protein
所需输出:
SL3.0ch02 702679 C A - - - - - - - - Solyc00g005000.3.1
SL3.0ch00 715124 G A - - - - - - - - Solyc00g005040.3.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005060.1.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005080.2.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005060.1.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005080.2.1
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED Solyc00g005060.1.1
import re
file1 = open("variation", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()
for i in probe_id:
i=i.rstrip()
probe_info=i.split('\t')
probe_info[1]=probe_info[1].strip()
probe_info[0]=probe_info[0].strip()
#print probe_info[1]
gene_list=[]
for j in loc_id:
loc_info=j.split('\t')
loc_info[2]=loc_info[2].strip()
loc_info[3]=loc_info[3].strip()
if loc_info[1]==probe_info[0]:
if (int(probe_info[1]) >= int(loc_info[2])):
if (int(probe_info[1]) <=int(loc_info[3])):
gene_list.append(loc_info[0])
if len(gene_list)!=0:
print i+"\t"+str(gene_list)
SL3.0ch02 702679 C A - - - - - - - - ['Solyc00g005000.3.1']
SL3.0ch00 715124 G A - - - - - - - - ['Solyc00g005040.3.1']
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED ['Solyc00g005060.1.1']
代码:
SL3.0ch02 702679 C A - - - - - - - - Solyc00g005000.3.1
SL3.0ch00 715124 G A - - - - - - - - Solyc00g005040.3.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005060.1.1
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) Solyc00g005080.2.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005060.1.1
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED Solyc00g005080.2.1
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED Solyc00g005060.1.1
import re
file1 = open("variation", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()
for i in probe_id:
i=i.rstrip()
probe_info=i.split('\t')
probe_info[1]=probe_info[1].strip()
probe_info[0]=probe_info[0].strip()
#print probe_info[1]
gene_list=[]
for j in loc_id:
loc_info=j.split('\t')
loc_info[2]=loc_info[2].strip()
loc_info[3]=loc_info[3].strip()
if loc_info[1]==probe_info[0]:
if (int(probe_info[1]) >= int(loc_info[2])):
if (int(probe_info[1]) <=int(loc_info[3])):
gene_list.append(loc_info[0])
if len(gene_list)!=0:
print i+"\t"+str(gene_list)
SL3.0ch02 702679 C A - - - - - - - - ['Solyc00g005000.3.1']
SL3.0ch00 715124 G A - - - - - - - - ['Solyc00g005040.3.1']
SL3.0ch00 723860 A C Solyc00g005060.1 CDS NONSYNONYMOUS W/G 52 0 novel DELETERIOUS (*WARNING! Low confidence) ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723867 A C Solyc00g005060.1 CDS SYNONYMOUS G/G 49 1 novel TOLERATED ['Solyc00g005060.1.1', 'Solyc00g005080.2.1']
SL3.0ch00 723903 T C Solyc00g005060.1 CDS SYNONYMOUS G/G 37 1 novel TOLERATED ['Solyc00g005060.1.1']
这是GNU awk的一个开始,它与染色体数目和范围内的位置相匹配:
$ awk '
NR==FNR {
a[$2][$3 " " $4]=$0 # store the annotations
next
}
($1 in a){ # if chromosome found
for(i in a[$1]) # process all the ranges
if(split(i,t)&&$2>=t[1]&&$2<=t[2]) # if there is a match
print # output
}' anno vari
这是GNU awk的一个开始,它与染色体数目和范围内的位置相匹配:
$ awk '
NR==FNR {
a[$2][$3 " " $4]=$0 # store the annotations
next
}
($1 in a){ # if chromosome found
for(i in a[$1]) # process all the ranges
if(split(i,t)&&$2>=t[1]&&$2<=t[2]) # if there is a match
print # output
}' anno vari
预处理“annotation.txt”并提前创建字典以减少循环中的计算将非常有效。
请尝试以下操作:
#!/usr/bin/python
import re
file1 = open("variation.txt", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()
annotation = {}
for i in loc_id:
loc_info=i.split('\t')
gene = loc_info[0].strip()
chromosome = loc_info[1].strip()
start = int(loc_info[2].strip())
end = int(loc_info[3].strip())
if (chromosome in annotation.keys()):
annotation[chromosome].append([start, end, gene])
else:
annotation[chromosome] = [[start, end, gene]]
for i in probe_id:
i = i.rstrip()
probe_info = i.split('\t')
position = int(probe_info[1].strip())
chromosome = probe_info[0].strip()
if (chromosome in annotation.keys()):
for j in annotation[chromosome]:
if (j[0] <= position and position <= j[1]):
print i + '\t' + j[2]
我想算法基本上接近于@James Brown的答案。希望这会有所帮助。预先处理“annotation.txt”并创建字典以减少循环中的计算将非常有效。
请尝试以下操作:
#!/usr/bin/python
import re
file1 = open("variation.txt", "r")
file2 = open("annotation.txt", "r")
probe_id = file1.read().splitlines()
loc_id = file2.read().splitlines()
annotation = {}
for i in loc_id:
loc_info=i.split('\t')
gene = loc_info[0].strip()
chromosome = loc_info[1].strip()
start = int(loc_info[2].strip())
end = int(loc_info[3].strip())
if (chromosome in annotation.keys()):
annotation[chromosome].append([start, end, gene])
else:
annotation[chromosome] = [[start, end, gene]]
for i in probe_id:
i = i.rstrip()
probe_info = i.split('\t')
position = int(probe_info[1].strip())
chromosome = probe_info[0].strip()
if (chromosome in annotation.keys()):
for j in annotation[chromosome]:
if (j[0] <= position and position <= j[1]):
print i + '\t' + j[2]
我想算法基本上接近于@James Brown的答案。希望这能有所帮助。将整个大文件读入内存,这样您就可以一次循环一行,这当然是一种反模式,在这里应该很容易修复。类似地,您正在循环使用
loc\u id
并将行处理成一个结构,然后将其丢弃,并在下一次迭代中再次执行相同的工作。所需输出中的第二条记录是否为错误(SL3.0ch00 715124…
)?将整个大文件读入内存,以便一次循环一行,这当然是一种反模式,在这里应该很容易修复。类似地,您正在循环使用loc_id
并将行处理成一个结构,然后将其丢弃,并在下一次迭代中再次执行相同的工作。所需输出中的第二条记录是否为错误(SL3.0ch00 715124…
)?