处理大型csv文件,从另一个csv Python中查找值

处理大型csv文件,从另一个csv Python中查找值,python,csv,Python,Csv,iam正在分析csv文件,我有一个csv文件如下所示: def map_GI(gilist, mapped): with open(gilist) as infile: read_gi = csv.reader(infile) GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary with open(mapping_file) as mappi

iam正在分析csv文件,我有一个csv文件如下所示:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
        read_gi = csv.reader(infile)
        GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
    with open(mapping_file) as mapping: # thats the cross reference
        read_mapping = csv.reader(mapping, delimiter='\t') 
        for gid, xids in read_mapping:
            for gi_seqid, gi_gids in GI_list:
                GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids] 
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
        looked_up_go = csv.writer(outfile, delimiter='\t')
        for key, val in XID_list.iteritems():
            looked_up_go.writerow([key] + val)
SeqID | GIs 序号123 456 序号999 888 777

我现在要做的是第二个文件,它的功能是交叉 参考资料,此图如下所示:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
      read_gi = csv.reader(infile)
      GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
      XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
      infile.close()
    with open(mapping_file) as mapping: # thats the cross reference
      read_mapping = csv.reader(mapping, delimiter='\t') 
      reference_mapping = list(read_reference) # write reference in list
      for k, v in GI_list.items():# iterate over GI list and mapping file
        for row in reference_mapping:
            if row[0] in v:
                XID_list[k].append(row[1]) # write found GOs into dictionary
      mapping.close()
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
      looked_up_go = csv.writer(outfile, delimiter='\t')
      for key, val in XID_list.iteritems():
         looked_up_go.writerow([key] + val)
GI | XIDs 123 X781 456 X676 789 X123 9999x217

目的是在运行的文件中查找每个Seq的GIs 作为交叉参考。问题是,这个交叉引用文件非常复杂 大2.3GB。到目前为止,我试图解决以下问题:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
      read_gi = csv.reader(infile)
      GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
      XID_list = defaultdict(list) # set up XID list as empty dictionary of lists
      infile.close()
    with open(mapping_file) as mapping: # thats the cross reference
      read_mapping = csv.reader(mapping, delimiter='\t') 
      reference_mapping = list(read_reference) # write reference in list
      for k, v in GI_list.items():# iterate over GI list and mapping file
        for row in reference_mapping:
            if row[0] in v:
                XID_list[k].append(row[1]) # write found GOs into dictionary
      mapping.close()
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
      looked_up_go = csv.writer(outfile, delimiter='\t')
      for key, val in XID_list.iteritems():
         looked_up_go.writerow([key] + val)
所需的输出应该是一个列出原始seqid和相应的seqid的文件 XIDs:

SeqID | XIDs Seka X781 X676

代码是有效的,但它需要永远,甚至更长的时间。写 我知道,列表中的交叉引用不是非常聪明。 我发现了一些相关的问题,但仍然不是我想要的


如果你有一个小文件和一个大文件,通常的答案是找到一种方法,在缓慢地迭代一次大文件的同时,如果可能,通过将小文件读入内存来重复迭代,如果没有,则重新读取文件,而不是相反

因此,从以下内容开始:

def map_GI(gilist, mapped):
    with open(gilist) as infile:
        read_gi = csv.reader(infile)
        GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
    with open(mapping_file) as mapping: # thats the cross reference
        read_mapping = csv.reader(mapping, delimiter='\t') 
        for gid, xids in read_mapping:
            for gi_seqid, gi_gids in GI_list:
                GI_list[seqid] = [xids if gi_gid == gid else gi_gid for gi_gid in gi_gids] 
    with open("/output.txt", 'wb') as outfile: # save mapped SeqIDs plus XIDs
        looked_up_go = csv.writer(outfile, delimiter='\t')
        for key, val in XID_list.iteritems():
            looked_up_go.writerow([key] + val)
但是,在这种情况下,如果小文件足够小,您可以做得更好:只需构建反向映射,这样您就可以查找需要修改的行,而不是遍历整个列表:

    with open(gilist) as infile:
        read_gi = csv.reader(infile)
        GI_list = {rows[0]:rows[1:] for rows in read_gi} # read GI list into dictionary
        revdict = defaultdict(list)
        for seqid, gids in GI_list.iteritems():
            for gid in gids:
                revdict[gid].append(seqid)
    with open(mapping_file) as mapping: # thats the cross reference
        read_mapping = csv.reader(mapping, delimiter='\t') 
        for gid, xids in read_mapping:
            for seqid in revmap[gid]:
                GI_list[seqid] = [(xids if gi_gid == gid else gi_gid) 
                                  for gi_gid in gi_gids] 

事实上,即使小文件不够小,无法放入内存,使用dbm而不是用于revdict的dict也可以使用相同的策略。

很明显,我当时坐在办公室,根本看不到它,非常感谢。