处理大型csv文件,从另一个csv Python中查找值
iam正在分析csv文件,我有一个csv文件如下所示:处理大型csv文件,从另一个csv Python中查找值,python,csv,Python,Csv,iam正在分析csv文件,我有一个csv文件如下所示: def map_GI(gilist, mapped): with open(gilist) as infile: read_gi = csv.reader(infile) GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary with open(mapping_file) as mappi
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
for gid, xids in read_mapping:
for gi_seqid, gi_gids in GI_list:
GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids]
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
SeqID | GIs
序号123 456
序号999 888 777
我现在要做的是第二个文件,它的功能是交叉
参考资料,此图如下所示:
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
infile.close()
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
reference_mapping = list(read_reference) # write reference in list
for k, v in GI_list.items():# iterate over GI list and mapping file
for row in reference_mapping:
if row[0] in v:
XID_list[k].append(row[1]) # write found GOs into dictionary
mapping.close()
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
GI | XIDs
123 X781
456 X676
789 X123
9999x217
目的是在运行的文件中查找每个Seq的GIs
作为交叉参考。问题是,这个交叉引用文件非常复杂
大2.3GB。到目前为止,我试图解决以下问题:
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
infile.close()
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
reference_mapping = list(read_reference) # write reference in list
for k, v in GI_list.items():# iterate over GI list and mapping file
for row in reference_mapping:
if row[0] in v:
XID_list[k].append(row[1]) # write found GOs into dictionary
mapping.close()
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
所需的输出应该是一个列出原始seqid和相应的seqid的文件
XIDs:
SeqID | XIDs
Seka X781 X676
代码是有效的,但它需要永远,甚至更长的时间。写
我知道,列表中的交叉引用不是非常聪明。
我发现了一些相关的问题,但仍然不是我想要的
如果你有一个小文件和一个大文件,通常的答案是找到一种方法,在缓慢地迭代一次大文件的同时,如果可能,通过将小文件读入内存来重复迭代,如果没有,则重新读取文件,而不是相反 因此,从以下内容开始:
def map_GI(gilist, mapped):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
for gid, xids in read_mapping:
for gi_seqid, gi_gids in GI_list:
GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids]
with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
looked_up_go = csv.writer(outfile, delimiter='\t')
for key, val in XID_list.iteritems():
looked_up_go.writerow([key] + val)
但是,在这种情况下,如果小文件足够小,您可以做得更好:只需构建反向映射,这样您就可以查找需要修改的行,而不是遍历整个列表:
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
revdict = defaultdict(list)
for seqid, gids in GI_list.iteritems():
for gid in gids:
revdict[gid].append(seqid)
with open(mapping_file) as mapping: # thats the cross reference
read_mapping = csv.reader(mapping, delimiter='\t')
for gid, xids in read_mapping:
for seqid in revmap[gid]:
GI_list[seqid] = [(xids if gi_gid == gid else gi_gid)
for gi_gid in gi_gids]
事实上,即使小文件不够小,无法放入内存,使用dbm而不是用于revdict的dict也可以使用相同的策略。很明显,我当时坐在办公室,根本看不到它,非常感谢。