C#-删除文本文件中的重复行_C#_.net_Windows

C#-删除文本文件中的重复行

c# .net windows

C#-删除文本文件中的重复行,c#,.net,windows,C#,.net,Windows,是否有人可以演示如何检查文件中的重复行，然后删除任何重复行，或者覆盖现有文件，或者使用删除的重复行创建新文件伪代码： open file reading only List<string> list = new List<string>(); for each line in the file: if(!list.contains(line)): list.append(line) close file open file for writi

是否有人可以演示如何检查文件中的重复行，然后删除任何重复行，或者覆盖现有文件，或者使用删除的重复行创建新文件

伪代码：

open file reading only

List<string> list = new List<string>();

for each line in the file:
    if(!list.contains(line)):
        list.append(line)

close file
open file for writing

for each string in list:
    file.write(string);

仅以读取方式打开文件
列表=新列表（）；
对于文件中的每一行：
如果（！list.contains（行））：
list.append（行）
关闭文件
打开文件进行写入
对于列表中的每个字符串：
file.write（字符串）；

如果您使用的是.NET4，则可以使用和的组合：

var previousLines=new HashSet（）；
File.writeAllines（destinationPath、File.ReadLines（sourcePath）
。其中（行=>前面的行。添加（行））；

这与LINQ的

Distinct

方法的功能几乎相同，但有一个重要的区别：不能保证

Distinct

的输出与输入序列的顺序相同。明确使用

HashSet

确实提供了这种保证

File.WriteAllLines(topath, File.ReadAllLines(frompath).Distinct().ToArray());

编辑：修改为在.NET3.5中工作

我们谈论的文件有多大

一种策略是一次读取一行，然后将它们加载到一个数据结构中，您可以轻松地检查现有项，例如

Hashset

。我知道我可以使用GetHashCode（）可靠地散列文件的每个字符串行（在内部用于检查字符串相等性——这是我们想要确定重复项的内容），只需检查已知的散列即可。比如

var known = new Hashset<int>();
using (var dupe_free = new StreamWriter(@"c:\path\to\dupe_free.txt"))
{
    foreach(var line in File.ReadLines(@"c:\path\to\has_dupes.txt")
    {
        var hash = line.GetHashCode();
        if (!known.Contains(hash)) 
        {
            known.Add(hash);
            dupe_free.Write(line);
        }
    }
}

@Felice Pollano没有同伴，除非我是一个28岁的学生：好吧，但不管怎样，你是在要求完成一项工作…@LukeH right，这就是为什么我的主要答案是用手写循环阅读和书写它们；hashset是一种廉价的查找，使用gethashcode可以保证正确的顺序和唯一性。hashset不保留插入顺序。我的意思是，有些情况下，它似乎是，但它不是保证。你可以在这里读

var known = new Hashset<int>();
using (var dupe_free = new StreamWriter(@"c:\path\to\dupe_free.txt"))
{
    foreach(var line in File.ReadLines(@"c:\path\to\has_dupes.txt")
    {
        var hash = line.GetHashCode();
        if (!known.Contains(hash)) 
        {
            known.Add(hash);
            dupe_free.Write(line);
        }
    }
}

File.WriteAllLines(@"c:\path\to\dupe_free.txt", File.ReadAllLines((@"c:\path\to\has_dupes.txt").Distinct().ToArray());

// Requires .NET 3.5
private void RemoveDuplicate(string sourceFilePath, string destinationFilePath)
{
    var readLines = File.ReadAllLines(sourceFilePath, Encoding.Default);

    File.WriteAllLines(destinationFilePath, readLines.Distinct().ToArray(), Encoding.Default);
}