Python-从csv文件中删除重复条目_Python_Csv

Python-从csv文件中删除重复条目

python csv

Python-从csv文件中删除重复条目,python,csv,Python,Csv,我有一个很大的poems.csv文件，其中包含如下条目： " this is a good poem. ",1 " this is a bad poem. ",0 " this is a good poem. ",1 " this is a bad poem. ",0 with open(data_in,'r') as in_file, open(data_out,'w') as out_file: seen = set() # set for fas

我有一个很大的

poems.csv

文件，其中包含如下条目：

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate
        seen.add(line)
        out_file.write(line)

我想从中删除重复项：

如果文件没有二进制分类器，我可以删除重复的行，如下所示：

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate
        seen.add(line)
        out_file.write(line)

但这也将删除所有分类。如何删除保留

0s

和

1s

的重复条目

预期产出：

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

熊猫作为pd

解决了它：

raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)

熊猫作为pd

解决了它：

raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)

您可以轻松地将线条的两部分添加到集合中。假设“行”由一个字符串和一个整数（或两个字符串）组成，则两个元素中的一个都可以是有效元素<代码>元组是不可变的，因此可以散列，并且可以添加到

集合

使用该类拆分行会容易得多，因为它允许您将多行诗作为一行来阅读，等等

import csv with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file: reader = csv.reader(in_file) writer = csv.writer(out_file) seen = set() # set for fast O(1) amortized lookup for row in reader: row = tuple(row) if row in seen: continue # skip duplicate seen.add(row) writer.writerow(row) 结果如下：

"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1

关于Python 2的注意事项

Python2版本的中不存在参数

newline

。这在大多数操作系统上都不会成为问题，因为输入和输出文件之间的行尾在内部是一致的。Python 2版本的请求不是指定

newline=''

，而是以二进制模式打开文件

更新

您已经指出，您自己的回答行为并非100%正确。看来，你的数据使它成为一个完全有效的方法，所以我保留我的答案的前一部分

要仅通过POME进行过滤，忽略（但保留）第一次出现的二进制分类器，您不需要在代码中做太多更改：

import csv

with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = set() # set for fast O(1) amortized lookup
    for row in reader:
        if row[0] in seen: continue # skip duplicate
        seen.add(row[0])
        writer.writerow(row)

由于零分类器首先出现在文件中，因此上述测试用例的输出为：

"Error 404:
Your Haiku could not be found.
Try again later.", 0

我在评论中提到，您还可以保留最后看到的分类器，或者如果找到了，则始终保留一个分类器。对于这两个选项，我建议使用一个（或者，如果您想保留诗歌的原始顺序）由诗歌键控，值为分类器。字典的键基本上是一个

集合

。在加载整个输入文件后，您还将最终写入输出文件

要保留最后看到的分类器，请执行以下操作：

import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        seen[poem] = classifier # Always update to get the latest classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

seen.items（）
此版本的输出将具有一个分类器，因为它最后出现在上面的测试输入中：
"Error 404:
Your Haiku could not be found.
Try again later.", 1

如果存在1分类器，则类似的方法可用于保留1分类器：
import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        if poem not in seen or classifier == '1'
            seen[poem] = classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

您可以轻松地将线条的两部分添加到集合中。假设“行”由一个字符串和一个整数（或两个字符串）组成，则两个元素中的一个都可以是有效元素<代码>元组

是不可变的，因此可以散列，并且可以添加到

集合