Python：从大型文本文件中删除重复_Python

Python：从大型文本文件中删除重复

python

Python：从大型文本文件中删除重复,python,Python,我需要我的代码从文件中删除重复的行，此时它只是复制与输出相同的文件。有人知道怎么解决这个问题吗？for循环没有像我希望的那样运行 #!usr/bin/python import os import sys #Reading Input file f = open(sys.argv[1]).readlines() #printing no of lines in the input file print "Total lines in the input file",len(f) #temp

我需要我的代码从文件中删除重复的行，此时它只是复制与输出相同的文件。有人知道怎么解决这个问题吗？for循环没有像我希望的那样运行

#!usr/bin/python
import os
import sys

#Reading Input file
f = open(sys.argv[1]).readlines()

#printing no of lines in the input file
print "Total lines in the input file",len(f)

#temporary dictionary to store the unique records/rows
temp = {}

#counter to count unique items
count = 0

for i in range(0,9057,1):
    if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
        temp[f[i]] = 1;
        count += 1
    else:   #if exact row is there then print duplicate record and dont store that
        print "Duplicate Records",f[i]
        continue;

#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)

#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
        nf.write(data)
nf.close()


# Written by Gary O'Neill
# Date 03-08-15

您应该在

temp

中测试

f[i]

是否存在

。更改行：

 if i not in temp:

与

您应该在

temp

中测试

f[i]

是否存在

。更改行：

 if i not in temp:

与

这是做你想做的事情的更好的方法：

infile_path = 'infile.csv'
outfile_path = 'outfile.csv'

written_lines = set()

with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
    for line in infile:
        if line not in written_lines:
            outfile.write(line)
            written_lines.add(line)
        else:
            print "Duplicate record: {}".format(line)

print "{} unique records".format(len(written_lines))

这将一次读取一行，因此它甚至可以在不适合内存的大型文件上工作。确实，如果它们大部分都是唯一的行，

写的行

最终都会变大，这比内存中几乎每一行都有两个副本要好。

这是一种更好的方法来完成您想要的任务：

infile_path = 'infile.csv'
outfile_path = 'outfile.csv'

written_lines = set()

with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
    for line in infile:
        if line not in written_lines:
            outfile.write(line)
            written_lines.add(line)
        else:
            print "Duplicate record: {}".format(line)

print "{} unique records".format(len(written_lines))

这将一次读取一行，因此它甚至可以在不适合内存的大型文件上工作。确实，如果它们大部分都是唯一的行，

写的行

最终都会变大，这比内存中几乎每行都有两个副本要好。

如果我不在temp:

是搜索键，我想你想要

如果f[i]不在temp:

，对吗？嗨，不，这仍然给了我同样的结果。加里，你试过Cyphase的答案吗？我认为，我的评论和他的回答都能帮助你。嗨，让·容格，我错了。事实上，你的建议非常有效。谢谢。

如果我不在temp:

是搜索键，我想你想要

如果f[i]不在temp:

，对吗？嗨，不。这仍然会给我相同的结果。Gary，你试过Cyphase的答案吗？我认为，我的评论和他的回答都能帮助你。嗨，让·容格，我错了。事实上，你的建议非常有效。非常感谢。