Python 如何根据条件分割csv文件?
我有这个csv文件:Python 如何根据条件分割csv文件?,python,csv,Python,Csv,我有这个csv文件: 89,网络活动,ip dst,80.179.42.44,,120160929 89,有效载荷交付,md5,4ad2924ced722ab65ff978f83a40448e,120160929 89,网络活动,域,alkamaihd.net,120160929 90,有效载荷交付,MD5197C01892223728683783654D3C632A,120160929 90,网络活动,域,dnsrecordsolver.tk,,120160929 90,网络活动,ip dst
89,网络活动,ip dst,80.179.42.44,,120160929
89,有效载荷交付,md5,4ad2924ced722ab65ff978f83a40448e,120160929
89,网络活动,域,alkamaihd.net,120160929
90,有效载荷交付,MD5197C01892223728683783654D3C632A,120160929
90,网络活动,域,dnsrecordsolver.tk,,120160929
90,网络活动,ip dst,178.33.94.47,120160929
90,有效载荷交付,文件名,Airline.xls,,120160929
91,有效载荷交付,md5,23A9BBF8D64AE893DB17777BEDC05,120160929
有效载荷交付,md5,07e47f06c5ed05a062e674f8d11b01d8,120160929
91,有效载荷交付,md5,bd75af219f417413a4e0fae8cd89febd,120160929
91,有效载荷交付,md5,9f4023f2aefc8c4c261bfdd4bd911952,120160929
91,网络活动,域,mailsinfo.net,120160929
91,有效载荷交付,md5,1e4653631feebf507faeb9406664792f,120160929
92,有效载荷交付,md5,6fa869f17b703a1282b8f386d0d87bd4,120160929
92,有效载荷交付,md5,24befa319fd96dea587f82eb945f5d2a,120160929
我需要将这个csv文件分为4个csv文件,其中条件是每行开头的事件编号。到目前为止,我创建了一个包含所有事件编号{89,90,91,92}的集合,我知道我需要在循环中进行循环,并将每一行复制到其专用的csv文件中
data = {
'89': [],
'90': [],
'91': [],
'92': []
}
with open('yourfile.csv') as infile:
for line in infile:
prefix = line[:2]
data[prefix].append(line)
for prefix in data.keys():
with open('csv' + prefix + '.csv', 'w') as csv:
csv.writelines(''.join(data[prefix]))
但是,如果您对python以外的解决方案持开放态度,那么可以通过运行四个命令轻松实现这一点
grep ^89 file.csv > 89.csv
grep ^90 file.csv > 90.csv
与其他值类似。如果没有遇到第一个字段,您甚至可以通过保留该id和关联文件的映射来动态创建结果文件:
files = {}
with open('file.csv') as fd:
for line in fd:
if 0 == len(line.strip()): continue # skip empty lines
try:
id_field = line.split(',', 1)[0] # extract first field
if not id in files.keys(): # if not encountered open a new result file
files[id] = open(id + '.csv')
files[id].write(line) # write the line in proper file
except Exception as e:
print('ERR', line, e) # catchall in case of problems...
最好不要在代码中硬编码事件编号,这样它就不依赖于数据的值。我还更喜欢使用
csv
模块,该模块经过优化,可以读取和写入.csv文件
下面是一种方法:
import csv
prefix = 'events' # of output csv file names
data = {}
with open('conditions.csv', 'rb') as conditions:
reader = csv.reader(conditions)
for row in reader:
data.setdefault(row[0], []).append(row)
for event in sorted(data):
csv_filename = '{}_{}.csv'.format(prefix, event)
print(csv_filename)
with open(csv_filename, 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(data[event])
更新
上面实现的方法首先将整个csv文件读取到内存中,然后将与每个事件值关联的所有行写入一个单独的输出文件,每次一个
一种更节省内存的方法是同时打开多个输出文件,并在将每一行读取到正确的目标文件后立即将其写入。这样做需要跟踪哪些文件已经打开。文件管理代码需要做的另一件事是确保在处理完成时关闭所有文件
在下面的代码中,所有这些都是通过定义和使用Python类型来集中处理所有csv输出文件来完成的,这些文件可能会根据输入文件中有多少不同的事件值生成
import csv
import sys
PY3 = sys.version_info.major > 2
class MultiCSVOutputFileManager(object):
"""Context manager to open and close multiple csv files and csv writers.
"""
def __enter__(self):
self.files = {}
return self
def __exit__(self, exc_type, exc_value, traceback):
for file, csv_writer in self.files.values():
print('closing file: {}'.format(file.name))
file.close()
self.files.clear()
return None
def get_csv_writer(self, filename):
if filename not in self.files: # new file?
open_kwargs = dict(mode='w', newline='') if PY3 else dict(mode='wb')
print('opening file: {}'.format(filename))
file = open(filename, **open_kwargs)
self.files[filename] = file, csv.writer(file)
return self.files[filename][1] # return associated csv.writer object
下面是如何使用它:
prefix = 'events' # to name of each csv output file
with open('conditions.csv', 'rb') as conditions:
reader = csv.reader(conditions)
with MultiCSVOutputFileManager() as file_manager:
for row in reader:
csv_filename = '{}_{}.csv'.format(prefix, row[0]) # row[0] is event
writer = file_manager.get_csv_writer(csv_filename)
writer.writerow(row)
看看这个类似的问题:我知道了,但是我得到了一个错误:“File”C:/Users/oshamir/untitled2.py”,第34行,在data[prefix].append(line)KeyError:“uu”