如果文件的大小非常大,如何确保文件在vb.net中具有唯一的行
语言:vb.net 文件大小:1GB等等 文本文件的编码:UTF8(因此每个字符由不同的字节数表示) 排序规则:UnicodeCI(当几个字符基本相同时,最流行的版本将是唯一的)。我想我知道怎么处理这个 由于每个字符由不同的字节数表示,并且每行具有不同的字符数,因此每行中的字节数也会有所不同 我想我们必须计算每行的哈希值。我们还需要将缓冲区存储在每个行所在的位置。然后我们必须比较缓冲区。然后我们将检查是否显示相同的行如果文件的大小非常大,如何确保文件在vb.net中具有唯一的行,vb.net,text,hashmap,collation,Vb.net,Text,Hashmap,Collation,语言:vb.net 文件大小:1GB等等 文本文件的编码:UTF8(因此每个字符由不同的字节数表示) 排序规则:UnicodeCI(当几个字符基本相同时,最流行的版本将是唯一的)。我想我知道怎么处理这个 由于每个字符由不同的字节数表示,并且每行具有不同的字符数,因此每行中的字节数也会有所不同 我想我们必须计算每行的哈希值。我们还需要将缓冲区存储在每个行所在的位置。然后我们必须比较缓冲区。然后我们将检查是否显示相同的行 是否有最适合的特殊函数?根据行的长度,您可以为每行计算MD5哈希值,并将其存储
是否有最适合的特殊函数?根据行的长度,您可以为每行计算MD5哈希值,并将其存储在
哈希映射中:
Using sr As New StreamReader("myFile")
Dim lines As New HashSet(Of String)
Dim md5 As New Security.Cryptography.MD5Cng()
While sr.BaseStream.Position < sr.BaseStream.Length
Dim l As String = sr.ReadLine()
Dim hash As String = String.Join(String.Empty, md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(l)).Select(Function(x) x.ToString("x2")))
If lines.Contains(hash) Then
'Lines are not unique
Exit While
Else
lines.Add(hash)
End If
End While
End Using
使用sr作为新的StreamReader(“myFile”)
将行调整为新哈希集(字符串的)
Dim md5作为新的Security.Cryptography.MD5Cng()
而sr.BaseStream.Position
未经测试,但这可能足以满足您的需求。我想不出比在HashMap中更快的方法了:)根据行的长度,您可能能够为每行计算MD5哈希值并进行存储,而不是在HashMap中存储:
Using sr As New StreamReader("myFile")
Dim lines As New HashSet(Of String)
Dim md5 As New Security.Cryptography.MD5Cng()
While sr.BaseStream.Position < sr.BaseStream.Length
Dim l As String = sr.ReadLine()
Dim hash As String = String.Join(String.Empty, md5.ComputeHash(System.Text.Encoding.UTF8.GetBytes(l)).Select(Function(x) x.ToString("x2")))
If lines.Contains(hash) Then
'Lines are not unique
Exit While
Else
lines.Add(hash)
End If
End While
End Using
使用sr作为新的StreamReader(“myFile”)
将行调整为新哈希集(字符串的)
Dim md5作为新的Security.Cryptography.MD5Cng()
而sr.BaseStream.Position
未经测试,但这可能足以满足您的需求。我想不出还有什么比这更能保持简洁的了:)这是当代的答案
Public Sub makeUniqueForLargeFiles(ByVal strFileSource As String)
Using sr As New System.IO.StreamReader(strFileSource)
Dim changeFileName = reserveFileName(strFileSource, False, True)
Using sw As New System.IO.StreamWriter(reserveFileName(strFileSource, False, True), False, defaultEncoding)
sr.Peek()
Dim lines As New Generic.Dictionary(Of Integer, System.Collections.Generic.List(Of Long))
While sr.BaseStream.Position < sr.BaseStream.Length
Dim offset = sr.BaseStream.Position
Dim l As String = sr.ReadLine()
Dim nextOffset = sr.BaseStream.Position
Dim hash = l.GetHashCode
Do ' a trick to put the for each in a "nest" that we can exit from
If lines.ContainsKey(hash) Then
Using sr2 = New System.IO.StreamReader(strFileSource)
For Each offset1 In lines.Item(hash)
sr2.BaseStream.Position = offset1
Dim l2 = sr2.ReadLine
If l = l2 Then
Exit Do 'will sr2.dispose be called here?
End If
Next
End Using
Else
lines.Add(hash, New Generic.List(Of Long))
End If
lines.Item(hash).Add(offset)
sw.WriteLine(l)
Loop While False
sr.BaseStream.Position = nextOffset
End While
End Using
End Using
End Sub
Public子makeUniqueForLargeFiles(ByVal strFileSource作为字符串)
使用sr作为新的System.IO.StreamReader(strFileSource)
Dim changeFileName=reserveFileName(strFileSource,False,True)
将sw用作新的System.IO.StreamWriter(reserveFileName(strFileSource,False,True),False,defaultEncoding)
高级皮克()
将行变暗为新的Generic.Dictionary(整型、System.Collections.Generic.List(长型))
而sr.BaseStream.Position
这是当代的答案
Public Sub makeUniqueForLargeFiles(ByVal strFileSource As String)
Using sr As New System.IO.StreamReader(strFileSource)
Dim changeFileName = reserveFileName(strFileSource, False, True)
Using sw As New System.IO.StreamWriter(reserveFileName(strFileSource, False, True), False, defaultEncoding)
sr.Peek()
Dim lines As New Generic.Dictionary(Of Integer, System.Collections.Generic.List(Of Long))
While sr.BaseStream.Position < sr.BaseStream.Length
Dim offset = sr.BaseStream.Position
Dim l As String = sr.ReadLine()
Dim nextOffset = sr.BaseStream.Position
Dim hash = l.GetHashCode
Do ' a trick to put the for each in a "nest" that we can exit from
If lines.ContainsKey(hash) Then
Using sr2 = New System.IO.StreamReader(strFileSource)
For Each offset1 In lines.Item(hash)
sr2.BaseStream.Position = offset1
Dim l2 = sr2.ReadLine
If l = l2 Then
Exit Do 'will sr2.dispose be called here?
End If
Next
End Using
Else
lines.Add(hash, New Generic.List(Of Long))
End If
lines.Item(hash).Add(offset)
sw.WriteLine(l)
Loop While False
sr.BaseStream.Position = nextOffset
End While
End Using
End Using
End Sub
Public子makeUniqueForLargeFiles(ByVal strFileSource作为字符串)
使用sr作为新的System.IO.StreamReader(strFileSource)
Dim changeFileName=reserveFileName(strFileSource,False,True)
将sw用作新的System.IO.StreamWriter(reserveFileName(strFileSource,False,True),False,defaultEncoding)
高级皮克()
将行变暗为新的Generic.Dictionary(整型、System.Collections.Generic.List(长型))
而sr.BaseStream.Position