Python 如何将单个CSV文件切片为多个按字段分组的较小文件,并删除最终CSV中的列';s

Python 如何将单个CSV文件切片为多个按字段分组的较小文件,并删除最终CSV中的列';s,python,csv,Python,Csv,即使这听起来像是一个重复的问题,我也没有找到解决办法。我有一个大的.csv文件,看起来像: prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country 1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB 1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSP

即使这听起来像是一个重复的问题,我也没有找到解决办法。我有一个大的.csv文件,看起来像:

prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....
目的是根据最后两列('ident'和'country')将此.csv文件分为多个较小的.csv文件

我使用了前一个答案中的代码,如下所示:

csv_contents = []
with open(outfile_path4, 'rb') as fin:
  dict_reader = csv.DictReader(fin)   # default delimiter is comma
  fieldnames = dict_reader.fieldnames # save for writing
  for line in dict_reader:            # read in all of your data
    csv_contents.append(line)         # gather data into a list (of dicts)

# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))


for groupkey, groupdata in it.groupby(sorted_csv_contents, 
                                  key=op.itemgetter('prot_desc','ident','country')):

  with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
    dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)    
    dict_writer.writerows(groupdata)
但是,我需要我的output.csv只包含“pep_seq”列,这是一个期望的输出,如:

pep_seq    
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW

我能做什么?

以下内容将输出每个国家/地区的csv文件,其中仅包含您需要的字段

我认为,你可以在你需要的第二个字段旁为分组添加另一个步骤

import csv

# use a dict so you can store the list of pep_seqs found for each country
# the country value with be the dict key
csv_rows_by_country = {}
with open('in.csv', 'rb') as csv_in:
    csv_reader = csv.reader(csv_in)
    for row in csv_reader:
        if row[7] in csv_rows_by_country:
            # add this pep_seq to the list we already found for this country
            csv_rows_by_country[row[7]].append(row[4])
        else:
            # start a new list for this country - we haven't seen it before
            csv_rows_by_country[row[7]] = [row[4],]

for country in csv_rows_by_country:
    # create a csv output file for each country and write the pep_seqs into it.
    with open('out_%s.csv' % (country, ), 'wb') as csv_out:
        csv_writer = csv.writer(csv_out)
        for pep_seq in csv_rows_by_country[country]:
            csv_writer.writerow([pep_seq, ])

您的代码几乎是正确的,只需正确设置
fieldsnames
,并设置
extraction='ignore'
。这会告诉
DictWriter
只写入您指定的字段:

import itertools   
import operator    
import csv

outfile_path4 = 'input.csv'    
outfile_path5 = r'my_output_folder\output.csv'
csv_contents = []

with open(outfile_path4, 'rb') as fin:
    dict_reader = csv.DictReader(fin)   # default delimiter is comma
    fieldnames = dict_reader.fieldnames # save for writing

    for line in dict_reader:            # read in all of your data
        csv_contents.append(line)         # gather data into a list (of dicts)

group = ['prot_desc','ident','country']
# input to itertools.groupby must be sorted by the grouping value 
sorted_csv_contents = sorted(csv_contents, key=operator.itemgetter(*group))

for groupkey, groupdata in itertools.groupby(sorted_csv_contents, key=operator.itemgetter(*group)):
    with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
        dict_writer = csv.DictWriter(fou, fieldnames=['pep_seq'], extrasaction='ignore')    
        dict_writer.writeheader()
        dict_writer.writerows(groupdata) 
这将为您提供一个输出csv文件,其中包含:

pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW

考虑使用<代码>熊猫。Read OpjsVc()/Case>和<代码>熊猫。toocvs](<)>代码>,您可以找到很多关于这些的内容。