Vb.net 在字符串中查找重复序列
我想在VB.Net中查找字符串中的重复序列,类似于: Dim测试为String=“EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB” 我希望程序检测重复序列,以防EDCRFVTGB,并计算重复的次数。我的问题是找到字符串中的重复序列,我搜索了几种方法,但没有找到解决方案,我尝试了快速排序算法,重复算法,但其中有几种不适用于字符串Vb.net 在字符串中查找重复序列,vb.net,string,algorithm,Vb.net,String,Algorithm,我想在VB.Net中查找字符串中的重复序列,类似于: Dim测试为String=“EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB” 我希望程序检测重复序列,以防EDCRFVTGB,并计算重复的次数。我的问题是找到字符串中的重复序列,我搜索了几种方法,但没有找到解决方案,我尝试了快速排序算法,重复算法,但其中有几种不适用于字符串 我想创建子字符串并检查它们在字符串中是否存在,但我不知
我想创建子字符串并检查它们在字符串中是否存在,但我不知道如何获取子字符串,因为字符串上没有模式,而且字符串中可能没有重复序列。您知道字符串从哪里开始吗? 你知道它有多长吗 简单的算法是:
for each character index i
for each character index after that j
compare substring(i, j-i) to substring(j, j-i)
if equal, record as a found repeating substring
还有一些优化,比如知道字符串不能超过字符串的末尾(j上的上限),并且只查找比您只找到的长度更长的子字符串
这不是超高效的(N平方),但是一个相关的广义问题(“编辑距离”)也不比N平方好,所以你可以这样做。首先检查目标字符串的一半是否重复两次。如果没有,请检查字符串的三分之一是否重复三次。如果没有,请检查字符串的四分之一是否重复四次。执行此操作,直到找到匹配的序列。跳过商不是整数的任何除数,使其性能更好。这段代码应该能做到这一点,并填补本说明未能澄清的任何空白:
Public Function DetermineSequence(ByVal strTarget As String) As String
Dim strSequence As String = String.Empty
Dim intLengthOfTarget As Integer = strTarget.Length
'Check for a valid Target string.
If intLengthOfTarget > 2 Then
'Try 1/2 of Target, 1/3 of Target, 1/4 of Target, etc until sequence is found.
Dim intCursor As Integer = 2
Do Until strSequence.Length > 0 OrElse intCursor = intLengthOfTarget
'Don't even test the string if its length is not a divisor (to an Integer) of the length of the target String.
If IsDividendDivisibleByDivisor(strTarget.Length, intCursor) Then
'Get the possible sequence.
Dim strPossibleSequence As String = strTarget.Substring(0, (intLengthOfTarget / intCursor))
'See if this possible sequence actually is the repeated String.
If IsPossibleSequenceRepeatedThroughoutTarget(strPossibleSequence, strTarget) Then
'The repeated sequence has been found.
strSequence = strPossibleSequence
End If
End If
intCursor += 1
Loop
End If
Return strSequence
End Function
Private Function IsDividendDivisibleByDivisor(ByVal intDividend As Integer, ByVal intDivisor As Integer) As Boolean
Dim bolDividendIsDivisbleByDivisor As Boolean = False
Dim intOutput As Integer
If Integer.TryParse((intDividend / intDivisor), intOutput) Then
bolDividendIsDivisbleByDivisor = True
End If
Return bolDividendIsDivisbleByDivisor
End Function
Private Function IsPossibleSequenceRepeatedThroughoutTarget(ByVal strPossibleSequence As String, ByVal strTarget As String) As Boolean
Dim bolPossibleSequenceIsRepeatedThroughoutTarget As Boolean = False
Dim intLengthOfTarget As Integer = strTarget.Length
Dim intLengthOfPossibleSequence As Integer = strPossibleSequence.Length
Dim bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated As Boolean = False
Dim intCursor As Integer = 1
Do Until (intCursor * intLengthOfPossibleSequence) = strTarget.Length OrElse bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated
If strTarget.Substring((intCursor * intLengthOfPossibleSequence), intLengthOfPossibleSequence) <> strPossibleSequence Then
bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated = True
End If
intCursor += 1
Loop
If Not bolIndicatorThatPossibleSequenceIsCertainlyNotRepeated Then
bolPossibleSequenceIsRepeatedThroughoutTarget = True
End If
Return bolPossibleSequenceIsRepeatedThroughoutTarget
End Function
公共函数DetermineSequence(ByVal strTarget作为字符串)作为字符串
Dim STRESQUENCE As String=String.Empty
Dim IntLengthoftTarget为整数=strTarget.Length
'检查有效的目标字符串。
如果IntLengthoftTarget>2,则
'尝试目标的1/2、目标的1/3、目标的1/4等,直到找到序列。
将光标设置为整数=2
直到strSequence.Length>0或LSE intCursor=intLengthOfTarget
'如果字符串的长度不是目标字符串长度的除数(整数),甚至不要测试该字符串。
如果IsDividedAddVisibleByDivisor(strTarget.Length,intCursor),则
'获取可能的序列。
Dim strPossibleSequence As String=strTarget.Substring(0,(intLengthoftTarget/intCursor))
'查看这个可能的序列是否实际上是重复的字符串。
如果IsPossibleSequencedThroughoutTarget(strPossibleSequence,strTarget)中重复了IsPossibleSequenced,则
"重复序列已经找到。
strSequence=strPossibleSequence
如果结束
如果结束
intCursor+=1
环
如果结束
返回序列
端函数
私有函数是作为布尔值的dividendiveVisibleByDivisor(ByVal IntDivision作为整数,ByVal intDivisor作为整数)
Dim BoldividendDisdivisibleByDivisor作为布尔值=False
Dim intOutput为整数
如果Integer.TryParse((intdivident/intdivisior),intOutput),则
BoldividendDisdivisibleByDivisor=真
如果结束
返回boldividendisdivisiblebydivisitor
端函数
私有函数IsPossibleSequencedRepeatedThroughoutTarget(ByVal strPossibleSequence作为字符串,ByVal strTarget作为字符串)作为布尔值
Dim BolPossibleSequencesRepeatedThroughOutTarget为布尔值=False
Dim IntLengthoftTarget为整数=strTarget.Length
Dim intLengthOfPossibleSequence作为整数=strPossibleSequence.Length
Dim BolIndicator或可能的序列Certainly不被视为布尔值=False
Dim intCursor作为整数=1
直到(intCursor*intLengthOfPossibleSequence)=strTarget.Length或LSE BolIndicator或可能的序列一定不被处理
如果strTarget.Substring((intCursor*intLengthOfPossibleSequence)、intLengthOfPossibleSequence)strPossibleSequence,则
BolIndicator或可能的序列CertainlyNotrepeated=真
如果结束
intCursor+=1
环
如果没有指示灯或可能的序列,则一定不进行处理
BolPossibleSequencesRepeatedThroughOutTarget=True
如果结束
返回整个目标的可能序列重复
端函数
这是一种算法,它以增量的方式生成所有重复序列,并按长度和首次出现顺序排列。它基于一个简单的想法:要在一个句子中找到一个单词两次,同一个起始字母必须出现两次
Java代码有一些解释(算法保持不变),它将输出交织的重复,例如,BANANA=>A,N,AN,NA,ANA(1,3),如果到前面的索引的距离小于字符串长度,则可以消除索引,以在该算法中更正它(下面的代码是一个示例运行,这应该可以更好地解释它):
public List getRepetitions(字符串){
列表重复=新建ArrayList();
Map rep=new HashMap(),repOld;
//init rep,添加所有单字符长度字符串的起始位置
对于(int i=0;i0;len++){
repOld=rep;
rep=新的HashMap();
对于(Map.EntrySet e:repOld.EntrySet()){
对于(整数i:e.getValue()){//对于所有开始索引
如果(i.intValue()+len+1>=string.length())
public List<String> getRepetitions(String string) {
List<String> repetitions = new ArrayList<String>();
Map<String, List<Integer>> rep = new HashMap<String, List<Integer>>(), repOld;
// init rep, add start position of all single character length strings
for (int i = 0; i < string.length(); i++) {
String s = string.substring(i, i + 1); // startIndex inclusive, endIndex exclusive
if (rep.containsKey(s)) {
rep.get(s).add(new Integer(i));
} else {
List<Integer> l = new ArrayList<Integer>();
l.add(new Integer(i));
rep.put(l);
}
}
// eliminate those with no repetitions and add the others to the solution
for (Map.Entry<String, Integer> e : rep.entrySet()) {
if (e.getValue().size() < 2) {
rep.remove(e.getKey());
} else {
repetitions.add(e.getKey());
}
}
for (int len = 1; rep.size() > 0; len++) {
repOld = rep;
rep = new HashMap<String, List<Integer>>();
for (Map.EntrySet<String, List<Integer>> e : repOld.entrySet()) {
for (Integer i : e.getValue()) { // for all start indices
if (i.intValue() + len + 1 >= string.length())
break;
String s = e.getKey() + string.charAt(i.intValue() + len + 1);
if (rep.containsKey(s)) {
rep.get(s).add(i);
} else {
List<Integer> l = new ArrayList<Integer>();
l.add(i);
rep.put(l);
}
}
}
// eliminate repetitions and add to solution
for (Map.Entry<String, Integer> e : rep.entrySet()) {
if (e.getValue().size() < 2) {
rep.remove(e.getKey());
} else {
repetitions.add(e.getKey());
}
}
}
return repetitions; // ordered by length, so last = longest
}
Option Strict On
Option Explicit On
Option Infer Off
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
ListView1.Items.Clear()
ListView1.Columns.Clear()
ListView1.Columns.Add("Sequence")
ListView1.Columns.Add("Indexes of occurrence")
Dim sequences As List(Of Sequence) = DetectSequences("EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB")
For Each s As Sequence In sequences
Dim item As New ListViewItem(s.Sequence)
item.Tag = s
item.SubItems.Add(s.IndexesToString)
ListView1.Items.Add(item)
Next
ListView1.AutoResizeColumns(ColumnHeaderAutoResizeStyle.HeaderSize)
End Sub
Function DetectSequences(s As String, Optional minLength As Integer = 5, Optional MaxLength As Integer = 8) As List(Of Sequence)
Dim foundPatterns As New List(Of String)
Dim foundSequences As New List(Of Sequence)
Dim potentialPattern As String = String.Empty, potentialMatch As String = String.Empty
For start As Integer = 0 To s.Length - 1
For length As Integer = 1 To s.Length - start
potentialPattern = s.Substring(start, length)
If potentialPattern.Length < minLength Then Continue For
If potentialPattern.Length > MaxLength Then Continue For
If foundPatterns.IndexOf(potentialPattern) = -1 Then
foundPatterns.Add(potentialPattern)
End If
Next
Next
For Each pattern As String In foundPatterns
Dim sequence As New Sequence With {.Sequence = pattern}
For start As Integer = 0 To s.Length - pattern.Length
Dim length As Integer = pattern.Length
potentialMatch = s.Substring(start, length)
If potentialMatch = pattern Then
sequence.Indexes.Add(start)
End If
Next
If sequence.Indexes.Count > 1 Then foundSequences.Add(sequence)
Next
Return foundSequences
End Function
Public Class Sequence
Public Sequence As String = ""
Public Indexes As New List(Of Integer)
Public Function IndexesToString() As String
Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To Indexes.Count - 1
If i = Indexes.Count - 1 Then
sb.Append(Indexes(i).ToString)
Else
sb.Append(Indexes(i).ToString & ", ")
End If
Next
Return sb.ToString
End Function
End Class
Private Sub ListView1_SelectedIndexChanged(sender As Object, e As EventArgs) Handles ListView1.SelectedIndexChanged
If ListView1.SelectedItems.Count = 0 Then Exit Sub
RichTextBox1.Clear()
RichTextBox1.Text = "EDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGBEDCRFVTGB"
Dim selectedSequence As Sequence = DirectCast(ListView1.SelectedItems(0).Tag, Sequence)
For Each i As Integer In selectedSequence.Indexes
RichTextBox1.SelectionStart = i
RichTextBox1.SelectionLength = selectedSequence.Sequence.Length
RichTextBox1.SelectionBackColor = Color.Red
Next
End Sub
End Class