Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/282.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何基于字段合并两个CSV文件,并在每个记录上保持相同数量的属性?_Python_Csv_Merge - Fatal编程技术网

Python 如何基于字段合并两个CSV文件,并在每个记录上保持相同数量的属性?

Python 如何基于字段合并两个CSV文件,并在每个记录上保持相同数量的属性?,python,csv,merge,Python,Csv,Merge,我正试图根据每个文件中的特定字段合并两个CSV文件 file1.csv id,attr1,attr2,attr3 1,True,7,"Purple" 2,False,19.8,"Cucumber" 3,False,-0.5,"A string with a comma, because it has one" 4,True,2,"Nope" 5,True,4.0,"Tuesday" 6,False,1,"Failure" id,attr4,attr5,attr6 2,"python",5000

我正试图根据每个文件中的特定字段合并两个CSV文件

file1.csv

id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
file2.csv

id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
这是我正在使用的代码:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    fields2 = next(reader,None) # Skip headers
    dict2 = {row[0]: row[1:] for row in reader}

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    fields1 = next(reader,None) # Skip headers
    dict1 = OrderedDict((row[0], row[1:]) for row in reader)

result = OrderedDict()
for d in (dict1, dict2):
    for key, value in d.iteritems():
        result.setdefault(key, []).extend(value)

with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for key, value in result.iteritems():
        w.writerow([key] + value)
我得到这样的输出,适当合并,但并非所有行的属性数都相同:

1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2
不会对
file1
中的每个
id
都有记录。我希望输出在合并文件中包含
file2
中的空字段。例如,
id
1如下所示:

1,True,7,Purple,,,
如何将空字段添加到
file2
中没有数据的记录中,以便合并CSV中的所有记录具有相同数量的属性?

您可以使用:

import pandas

csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)

我还没有测试过这个,但它应该让你在正确的轨道上,直到我可以尝试它。代码是非常自解释的;首先导入
pandas
库,以便使用它。然后使用
pandas.read_csv
读取2个csv文件,并使用
merge
方法合并它们。
on
参数指定应将哪个列用作“键”。最后,合并后的csv被写入
output.csv

如果我们不使用
pandas
,我会重构为

import csv
from collections import OrderedDict

filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
    with open(filename, "rb") as fp: # python 2
        reader = csv.DictReader(fp)
        fieldnames.extend(reader.fieldnames)
        for row in reader:
            data.setdefault(row["id"], {}).update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
    writer = csv.writer(fp)
    writer.writerow(fieldnames)
    for row in data.itervalues():
        writer.writerow([row.get(field, '') for field in fieldnames])
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)

id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
相比之下,
pandas
的等价物应该是

import csv
from collections import OrderedDict

filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
    with open(filename, "rb") as fp: # python 2
        reader = csv.DictReader(fp)
        fieldnames.extend(reader.fieldnames)
        for row in reader:
            data.setdefault(row["id"], {}).update(row)

fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
    writer = csv.writer(fp)
    writer.writerow(fieldnames)
    for row in data.itervalues():
        writer.writerow([row.get(field, '') for field in fieldnames])
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)

在我看来,这要简单得多,这意味着你可以花更多的时间处理你的数据,更少的时间重新发明轮子。

使用dict of dict,然后更新它。像这样:

import csv
from collections import OrderedDict

with open('file2.csv','r') as f2:
    reader = csv.reader(f2)
    lines2 = list(reader)

with open('file1.csv','r') as f1:
    reader = csv.reader(f1)
    lines1 = list(reader)

dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}

#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
    d = mergedAttrs.copy()
    d.update(attrs)
    updatedDict[id] = d

for id, attrs in dict2.iteritems():
    updatedDict[id].update(attrs)

#out
with open('merged.csv', 'wb') as f:
    w = csv.writer(f)
    for id, rest in sorted(updatedDict.iteritems()):
        w.writerow([id] + rest.values())

是否还希望标题行为
id、attr1、attr2、attr3、attr4、attr5、attr6
?@s16h是。我的示例代码中没有包含该代码。不过,我已经在工作了。