python中的嵌套dict和值
您可以看到这样的文件:python中的嵌套dict和值,python,Python,您可以看到这样的文件: LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1 LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2 LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1 LOC_Os06g48240.1 chlo 9, mito 4 LOC_Os06g4825
LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1
LOC_Os06g48240.1 chlo 9, mito 4
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2
我关心“chlo”和“chlo_mito”和“mito”,以及每行中的总和值
像行LOC_Os06g07630.1一样,我将使用chlo 2和chlo_mito 1,
总和值为3=(chlo)2+(chlo_-mito)1
行和值为
(cyto)8+(chlo)2+(extr)2+(nucl)1+(cysk)1+(chlo_-mito)1+(cysk_-nucl)1=16,然后打印3/16
我想获得下一个内容:
LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5
LOC_Os06g39870.1 chlo 7 7/15
LOC_Os06g48240.1 chlo 9 mito 4 13/13
LOC_Os06g48250.1 chlo 4 mito 2 6/13
我的代码是:
import re
dic={}
b=re.compile("chlo|mito|chlo_mito")
with open("~/A","r") as f1:
for i in f1:
if i.startswith("#"):continue
a=i.replace(',',"").replace(" ","/")
m=b.search(a)
if m is not None:
dic[a.strip().split("/")[0]]={}
temp=a.strip().split("/")[1:]
c=range(1,len(temp),2)
for x in c:
dic[a.strip().split("/")[0]][temp[x-1]]=temp[x]
#print dic
lis=["chlo","mito","chlo_mito"]
for k in dic:
sum_value=0
sum_values=0
for x in dic[k]:
sum_value=sum_value+float(dic[k][x])
for i in lis:
#sum_values=0
if i in dic[k]:
#print i,dic[k][i]
sum_values=sum_value+float(dic[k][i])
print k,dic[k],i,sum_values
#print k,dic[k]
你在描述你的问题时不是很清楚。但我要做的是:编写一个函数,从您的文件中获取一行作为输入,并返回一个带有键“chlo”、“chlo_mito”、“mito”和“total sum”的字典。这将使您的生活更加轻松。类似于此代码的内容可能会帮助您: 我假定您的输入文件名为
f_input.txt
:
from ast import literal_eval as eval
data = (k.rstrip().replace(',', '').split() for k in open("f_input.txt", 'r'))
for k in data:
chlo = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo')
mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'mito')
chlo_mito = sum(eval(k[j+1]) for j in range(len(k)-1) if k[j] == 'chlo_mito')
total = sum(eval(k[j]) for j in range(2, len(k), 2))
if mito == 0 and chlo_mito != 0:
print("{0} chlo {1} chlo_mito {2} {3}/{4}".format(k[0], chlo, chlo_mito, chlo + chlo_mito, total))
elif mito != 0 and chlo_mito == 0:
print("{0} chlo {1} mito {2} {3}/{4}".format(k[0], chlo, mito, chlo + mito, total))
elif mito !=0 and chlo_mito != 0:
print("{0} chlo {1} mito {2} chlo_mito {3} {4}/{5}".format(k[0], chlo, mito, chlo_mito, chlo + mito + chlo_mito, total))
elif mito ==0 and chlo_mito == 0:
print("{0} chlo {1} {2}/{3}".format(k[0], chlo, chlo , total))
输出:
LOC_Os06g07630.1 chlo 2 chlo_mito 1 3/16
LOC_Os06g12160.1 chlo 7 mito 2.5 9.5/14.5
LOC_Os06g39870.1 chlo 7 7/14
LOC_Os06g48240.1 chlo 9 mito 4 13/13
LOC_Os06g48250.1 chlo 4 mito 2 6/13
我不确定你关心的速度有多快,但在基因组学中通常是这样。如果可以避免的话,您可能不应该使用太多的字符串操作,而应该使用尽可能少的正则表达式 这是一个不使用regexen的版本,并且尝试不花费任何时间构建临时对象。我选择使用与您提供的不同的输出格式,因为您的输出格式很难再次解析。您可以通过修改
.format
字符串轻松地将其更改回
Test_data = """
LOC_Os06g07630.1 cyto 8, chlo 2, extr 2, nucl 1, cysk 1, chlo_mito 1, cysk_nucl 1
LOC_Os06g12160.1 chlo 7, nucl 3, mito 2.5, cyto_mito 2
LOC_Os06g39870.1 chlo 7, cyto 4, nucl 1, E.R. 1, pero 1
LOC_Os06g48240.1 chlo 9, mito 4
LOC_Os06g48250.1 cyto 5, chlo 4, mito 2, pero 2
"""
def open_input():
"""
Return a file-like object as input stream. In this case,
it is a StringIO based on your test data. If you have a file
name, use that instead.
"""
if False:
return open('inputfile.txt', 'r')
else:
import io
return io.StringIO(Test_data)
SUM_FIELDS = set("chlo mito chlo_mito".split())
with open_input() as infile:
for line in infile:
line = line.strip()
if not line: continue
cols = line.split(maxsplit=1)
if len(cols) != 2: continue
test_id,remainder = cols
out_fields = []
fld_sum = tot_sum = 0.0
for pair in remainder.split(', '):
k,v = pair.rsplit(maxsplit=1)
vf = float(v)
tot_sum += vf
if k in SUM_FIELDS:
fld_sum += vf
out_fields.append(pair)
print("{0} {2}/{3} ({4:.0%}) {1}".format(test_id, ', '.join(out_fields), fld_sum, tot_sum, fld_sum/tot_sum))
但是每一行都有其他的,比如“nucl”等等,它们的数量是不同的