Python 如何在检查重复的行标题和共同添加新数据时编译多个csv文件
我已经彻底地搜索了一遍,试图找到我问题的答案,但运气不好 为了更好地解释我的问题,我的任务是合并多个.csv文件,同时还要做一些其他事情。例如,假设我有三个名为run_1.csv、run_2.csv和run_3.csv的文件,它们都位于一个名为/runs的目录中/ run_1.csv看起来像:Python 如何在检查重复的行标题和共同添加新数据时编译多个csv文件,python,csv,Python,Csv,我已经彻底地搜索了一遍,试图找到我问题的答案,但运气不好 为了更好地解释我的问题,我的任务是合并多个.csv文件,同时还要做一些其他事情。例如,假设我有三个名为run_1.csv、run_2.csv和run_3.csv的文件,它们都位于一个名为/runs的目录中/ run_1.csv看起来像: Name, Mass (kg), run_1 One, 1, 5.4 Two, 2, 4.5 Three, 3, 6.5 Name, Mass (kg), run_2 One, 1, 5.7 Two,
Name, Mass (kg), run_1
One, 1, 5.4
Two, 2, 4.5
Three, 3, 6.5
Name, Mass (kg), run_2
One, 1, 5.7
Two, 2, 6.7
Name, Mass (kg), run_3
One, 1, 4.7
Three, 3, 5.9
Four, 4, 2.0
run_2.csv看起来像:
Name, Mass (kg), run_1
One, 1, 5.4
Two, 2, 4.5
Three, 3, 6.5
Name, Mass (kg), run_2
One, 1, 5.7
Two, 2, 6.7
Name, Mass (kg), run_3
One, 1, 4.7
Three, 3, 5.9
Four, 4, 2.0
而run_3.csv看起来像:
Name, Mass (kg), run_1
One, 1, 5.4
Two, 2, 4.5
Three, 3, 6.5
Name, Mass (kg), run_2
One, 1, 5.7
Two, 2, 6.7
Name, Mass (kg), run_3
One, 1, 4.7
Three, 3, 5.9
Four, 4, 2.0
我希望我的输出文件(output.csv)如下所示:(注意,行的顺序并不重要)
目前,我在csv模块中工作,并做了如下工作:
import os
import csv
fields = ['name', 'mass', 'run_1', 'run_2', 'run_3']
with open('output.csv', 'wb') as csvfile :
writer = csv.writer(csvfile, delimiter=",")
writer.writerow(fields) #write the header
file_names= []
for file in os.listdir(/runs/):
file_names.append(file)
with open(/runs/+file_name+'.csv', 'rb') as infile:
reader = csv.reader(infile)
reader.next() #just skipping the first row, the header
entries = set()
for row in reader:
line = []
key = row[0]
time = row[2]
if key not in entries:
row.remove(row[-1])
line.extend(row)
for number in images_full:
line.append('')
line.insert(fields.index(file_name.strip('.csv')), time)
writer.writerow(line)
elif key in entries:
row.remove(row[-1])
line.extend(row)
for number in images_full:
line.append('')
line.insert(fields.index(file_name.strip('.csv')), time)
writer.writerow(line) #BUT, I only want it too add this data into the missing spot, not overwrite the whole line!
name,mass,run_1,run_2,run_3
One,1, 5.4, 5.7, 4.7
Three,3, 6.5, , 5.9
Two,2, 4.5, 6.7,
Four,4, , , 2.0
因此,我不知所措,非常感谢您的帮助。输入csv文件是可以更改的,但是我相信有一种方法可以实现这一点
编辑:通过将原始csv读入词典,然后将其写出,解决了这个问题,请参见下文:
counter = 0
with open(/result+total_data_file_name, 'wb') as outfile:
writer = csv.writer(outfile)
writer.writerow(fields)
fields.pop(0)
for names in result.keys():
line = []
name = result.keys()[counter]
line.append(name)
for field_key in fields:
try:
line.append(result[name][field_key])
except KeyError:
line.append('')
counter += 1
writer.writerow(line)
这将把头下面的所有值放入一个dict中。除去dup,您可以只写头,然后写键/值
from collections import defaultdict
new_data_dict = {}
files = ["in.csv","in2.csv","in3.csv"]
for f in files:
with open(f) as f:
f.next()
for row in f:
row = row.strip().split(",")
new_data_dict.setdefault(row[0],set())
new_data_dict[row[0]].update(row[1:])
{'Four': set([' 2.0', ' 4']), 'Three': set([' 3', ' 6.5', ' 5.9']), 'Two': set([' 2', ' 6.7', ' 4.5']), 'One': set([' 5.7', ' 5.4', ' 1', ' 4.7'])}
要写入数据,请执行以下操作:
import csv
new_data_dict = {}
files = ["in.csv","in2.csv","in3.csv"]
headers = set()
for f in files:
with open(f) as f:
headers.update(f.next().rstrip().split(",")[2:])
for row in f:
row = row.strip().split(",")
new_data_dict.setdefault(row[0],set())
new_data_dict[row[0]].update(row[1:])
headers = ["Name","Mass (kg)"] + sorted(headers,key=lambda x: int(x.split("_")[-1]))
with open("out.csv","w") as out:
writer = csv.writer(out)
writer.writerow(headers)
for k,v in new_data_dict.items():
writer.writerow([k]+list(v))
为了维持秩序:
for f in files:
with open(f) as f:
headers.update(f.next().rstrip().split(",")[2:])
for row in f:
row = row.strip().split(",")
new_data_dict.setdefault(row[0],[])
new_data_dict[row[0]]+= row[1:]
headers = ["Name","Mass (kg)"] + sorted(headers,key=lambda x: int(x.split("_")[-1]))
with open("out.csv","w") as out:
writer = csv.writer(out)
writer.writerow(headers)
for k,v in new_data_dict.items():
writer.writerow([k]+sorted(set(v),key=lambda x: new_data_dict[k].index(x)))
Name,Mass (kg), run_1, run_2, run_3
Four, 4, 2.0
Three, 3, 6.5, 5.9
Two, 2, 4.5, 6.7
One, 1, 5.4, 5.7, 4.7
我认为字典是存储键/值对所需的工具。此外,在写出任何内容之前,您需要先解析所有文件 编辑:如果在没有某个字段条目的情况下运行时需要空格,则可以使用词典词典。
import os
import csv
import string
fields = ['name', 'mass', 'run_1', 'run_2', 'run_3']
with open('output.csv', 'wb') as csvfile :
writer = csv.writer(csvfile, delimiter=",")
writer.writerow(fields) #write the header
file_names= []
# Use a dictionary to store result of all runs.
# Each key is this dictionary is a string like 'One', 'Two', 'Three', etc.
# The values are themselves dictionaries, with a key of the run index.
runs = dict()
# parse all the files first
for file in os.listdir('runs/'):
file_names.append(file)
with open('runs/'+file, 'rb') as infile:
reader = csv.reader(infile)
reader.next() #just skipping the first row, the header
# Get the run index for the sub-key
temp = string.rstrip(file,'.csv')
run_index = int(string.lstrip(temp,'runs_'))
for row in reader:
key = row[0]
index = row[1]
time = row[2]
# make the key a string like "Four 4"
key = key + ' ' + index # use whitespace delimeter
if key not in runs:
# create a new dict entry
runs[key] = dict()
runs[key][run_index] = time
elif key in runs:
# add to the existing dict
value = runs[key]
value[run_index] = time
runs[key] = value
# find the run with max number of elements in its sub-dictionary
max_entries = 0
key_w_max_entries = -1
for key in runs.keys():
if len(runs[key].keys()) > max_entries:
max_entries = len(runs[key].keys())
key_w_max_entries = key
# now write out the dictionary values
for key in runs.keys():
line = []
words = key.split() # split on whitespace
for word in words:
line.append(word)
for i in runs[key_w_max_entries].keys():
try:
line.append(str(runs[key][i]))
except:
# if the key doesn't exist in the sub-dictionary, fill in a blank
line.append(' ')
writer.writerow(line)
给我一个这样的文件:
import os
import csv
fields = ['name', 'mass', 'run_1', 'run_2', 'run_3']
with open('output.csv', 'wb') as csvfile :
writer = csv.writer(csvfile, delimiter=",")
writer.writerow(fields) #write the header
file_names= []
for file in os.listdir(/runs/):
file_names.append(file)
with open(/runs/+file_name+'.csv', 'rb') as infile:
reader = csv.reader(infile)
reader.next() #just skipping the first row, the header
entries = set()
for row in reader:
line = []
key = row[0]
time = row[2]
if key not in entries:
row.remove(row[-1])
line.extend(row)
for number in images_full:
line.append('')
line.insert(fields.index(file_name.strip('.csv')), time)
writer.writerow(line)
elif key in entries:
row.remove(row[-1])
line.extend(row)
for number in images_full:
line.append('')
line.insert(fields.index(file_name.strip('.csv')), time)
writer.writerow(line) #BUT, I only want it too add this data into the missing spot, not overwrite the whole line!
name,mass,run_1,run_2,run_3
One,1, 5.4, 5.7, 4.7
Three,3, 6.5, , 5.9
Two,2, 4.5, 6.7,
Four,4, , , 2.0
看来使用字典将是其中的一个关键,我一定会努力编写代码来集成它们。但是,您的代码不保留列。例如,名称“Four”的质量不是2.0,而是4,在下一列中。我可以看到错误的方式…我输出你有代码>一,1,5.4,5.7,4.7 < /代码>这是哪一个?我喜欢你所做的事情的简单性,但是类似于PADRAIC的建议,输出代码不包含没有信息的列中的空白信息。例如,name Four没有为run_1提供任何信息,但是您的输出文件显示run_1为2.0秒。不过,我可能会看到一个解决方案,就像你上次写的那样,给我几个小时的时间,我会回复你的意见。很抱歉,我错过了你想要在输出文件中保留空格的消息。空格是否需要对应于特定的运行?或者,运行次数的空格/条目相等是否重要?i、 e.对于第四次跑步,我能在最后加上“,”吗?谢谢你的回复,是的,空格需要对应于任何一次跑步。我是一名天体物理学家,正在使用该代码编译有关系外行星探测的数据。每个系外行星都有一些图像(对应于单个csv)的信息,而其他csv的信息可能不在框架上,因此应该用空白表示。请参阅我的最新编辑。您可能希望动态填充“字段”,以支持不同的运行次数。