Python-从csv文件中删除重复条目
我有一个很大的Python-从csv文件中删除重复条目,python,csv,Python,Csv,我有一个很大的poems.csv文件,其中包含如下条目: " this is a good poem. ",1 " this is a bad poem. ",0 " this is a good poem. ",1 " this is a bad poem. ",0 with open(data_in,'r') as in_file, open(data_out,'w') as out_file: seen = set() # set for fas
poems.csv
文件,其中包含如下条目:
"
this is a good poem.
",1
"
this is a bad poem.
",0
"
this is a good poem.
",1
"
this is a bad poem.
",0
with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
我想从中删除重复项:
如果文件没有二进制分类器,我可以删除重复的行,如下所示:
"
this is a good poem.
",1
"
this is a bad poem.
",0
"
this is a good poem.
",1
"
this is a bad poem.
",0
with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
但这也将删除所有分类。如何删除保留0s
和1s
的重复条目
预期产出:
"
this is a good poem.
",1
"
this is a bad poem.
",0
熊猫作为pd
解决了它:
raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)
熊猫作为pd
解决了它:
raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)
您可以轻松地将线条的两部分添加到集合中。假设“行”由一个字符串和一个整数(或两个字符串)组成,则两个元素中的一个都可以是有效元素<代码>元组是不可变的,因此可以散列,并且可以添加到
集合
使用该类拆分行会容易得多,因为它允许您将多行诗作为一行来阅读,等等
import csv
with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = set() # set for fast O(1) amortized lookup
for row in reader:
row = tuple(row)
if row in seen: continue # skip duplicate
seen.add(row)
writer.writerow(row)
结果如下:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
关于Python 2的注意事项
Python2版本的中不存在参数newline
。这在大多数操作系统上都不会成为问题,因为输入和输出文件之间的行尾在内部是一致的。Python 2版本的请求不是指定newline=''
,而是以二进制模式打开文件
更新
您已经指出,您自己的回答行为并非100%正确。看来,你的数据使它成为一个完全有效的方法,所以我保留我的答案的前一部分
要仅通过POME进行过滤,忽略(但保留)第一次出现的二进制分类器,您不需要在代码中做太多更改:
import csv
with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = set() # set for fast O(1) amortized lookup
for row in reader:
if row[0] in seen: continue # skip duplicate
seen.add(row[0])
writer.writerow(row)
由于零分类器首先出现在文件中,因此上述测试用例的输出为:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
我在评论中提到,您还可以保留最后看到的分类器,或者如果找到了,则始终保留一个分类器。对于这两个选项,我建议使用一个(或者,如果您想保留诗歌的原始顺序)由诗歌键控,值为分类器。字典的键基本上是一个集合
。在加载整个输入文件后,您还将最终写入输出文件
要保留最后看到的分类器,请执行以下操作:
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
seen[poem] = classifier # Always update to get the latest classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)
seen.items()
此版本的输出将具有一个分类器,因为它最后出现在上面的测试输入中:
"Error 404:
Your Haiku could not be found.
Try again later.", 1
如果存在1分类器,则类似的方法可用于保留1分类器:
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
if poem not in seen or classifier == '1'
seen[poem] = classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)
您可以轻松地将线条的两部分添加到集合中。假设“行”由一个字符串和一个整数(或两个字符串)组成,则两个元素中的一个都可以是有效元素<代码>元组
是不可变的,因此可以散列,并且可以添加到集合
使用该类拆分行会容易得多,因为它允许您将多行诗作为一行来阅读,等等
import csv
with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = set() # set for fast O(1) amortized lookup
for row in reader:
row = tuple(row)
if row in seen: continue # skip duplicate
seen.add(row)
writer.writerow(row)
结果如下:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
关于Python 2的注意事项
Python2版本的中不存在参数newline
。这在大多数操作系统上都不会成为问题,因为输入和输出文件之间的行尾在内部是一致的。Python 2版本的请求不是指定newline=''
,而是以二进制模式打开文件
更新
您已经指出,您自己的回答行为并非100%正确。看来,你的数据使它成为一个完全有效的方法,所以我保留我的答案的前一部分
要仅通过POME进行过滤,忽略(但保留)第一次出现的二进制分类器,您不需要在代码中做太多更改:
import csv
with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = set() # set for fast O(1) amortized lookup
for row in reader:
if row[0] in seen: continue # skip duplicate
seen.add(row[0])
writer.writerow(row)
由于零分类器首先出现在文件中,因此上述测试用例的输出为:
"Error 404:
Your Haiku could not be found.
Try again later.", 0
我在评论中提到,您还可以保留最后看到的分类器,或者如果找到了,则始终保留一个分类器。对于这两个选项,我建议使用一个(或者,如果您想保留诗歌的原始顺序)由诗歌键控,值为分类器。字典的键基本上是一个集合
。在加载整个输入文件后,您还将最终写入输出文件
要保留最后看到的分类器,请执行以下操作:
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
seen[poem] = classifier # Always update to get the latest classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)
seen.items()
此版本的输出将具有一个分类器,因为它最后出现在上面的测试输入中:
"Error 404:
Your Haiku could not be found.
Try again later.", 1
如果存在1分类器,则类似的方法可用于保留1分类器:
import csv
from collections import OrderedDict
with open(data_in, 'r', newline='') as in_file:
reader = csv.reader(in_file)
writer = csv.writer(out_file)
seen = OrderedDict() # map for fast O(1) amortized lookup
for poem, classifier in reader:
if poem not in seen or classifier == '1'
seen[poem] = classifier
with open(data_out, 'w', newline='') as out_file:
for row in seen.items():
writer.writerow(row)
另外,我想澄清一下,两首相同的诗,用不同的二元量词,是否被认为是不同的?你自己的答案似乎支持这一点,但我想确认一下。不,相同的条目总是有相同的分类器。你能举一个预期输入和输出的例子吗?当您只显示一行而没有预期的输出时,很难准确地知道您想要放弃什么。我提供了一个示例输入,如果您愿意,可以使用。谢谢,您的示例非常完美。我的数据是以同样的方式构造的。另外,我想澄清一下,两首相同的诗,用不同的二元量词,是否被认为是不同的?你自己的答案似乎支持这一点,但我想确认一下。不,相同的条目总是有相同的分类器。你能举一个预期输入和输出的例子吗?当您只显示一行而没有预期的输出时,很难准确地知道您想要放弃什么。我提供了一个示例输入,如果您愿意,可以使用。谢谢,您的示例非常完美。我的数据是以同样的方式构造的。我根本不提这一点。我想没有新行你会没事的