Algorithm 一种有效的重叠串接算法

Algorithm 一种有效的重叠串接算法,algorithm,language-agnostic,string,Algorithm,Language Agnostic,String,我们需要通过连接将数据库中的3列组合起来。但是,3列可能包含重叠的部分,不应重复这些部分。比如说, "a" + "b" + "c" => "abc" "abcde" + "defgh" + "ghlmn" => "abcdefghlmn" "abcdede" + "dedefgh" + "" => "abcdedefgh" "abcde" + "d" + "ghlmn" => "abcdedghlmn" "abcdef" + "" + "defghl"

我们需要通过连接将数据库中的3列组合起来。但是,3列可能包含重叠的部分,不应重复这些部分。比如说,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"
我们当前的算法非常慢,因为它使用蛮力来识别两个字符串之间的重叠部分。有人知道一个有效的算法来实现这一点吗

假设我们有两个字符串A和B。算法需要找到最长的公共子字符串S,以便A以S结束,B以S开始

我们目前在Java中的蛮力实现附呈以供参考

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}
公共静态字符串concat(字符串s1、字符串s2){
如果(s1==null)
返回s2;
if(s2==null)
返回s1;
int len=Math.min(s1.length(),s2.length());
//查找重叠零件末端的索引
int指数=-1;
对于(int i=len;i>0;i--){
字符串子字符串=s2。子字符串(0,i);
if(s1.带(子字符串)的结束开关){
指数=i;
打破
}
}
StringBuilder sb=新的StringBuilder(s1);
如果(指数<0)
sb.追加(s2);

否则,如果(index为什么不这样做呢。 首先获取三列中的第一个字符或单词(表示重叠)

然后,开始将第一个字符串添加到stringbuffer,每次添加一个字符

每次查看是否到达与第二个或第三个字符串重叠的部分

如果是这样,那么开始连接同样包含第一个字符串的字符串

完成后开始,如果没有重叠,则从第二个字符串开始,然后是第三个字符串

在问题的第二个例子中,我将d和g保持在两个变量中

然后,当我添加第一个字符串时 abc来自第一个字符串,然后我看到d也在第二个字符串中,所以我从第二个字符串转换为加法 def是从字符串2添加的,然后我继续并以字符串3结束


如果在数据库中执行此操作,为什么不使用存储过程执行此操作?

如果在数据库之外执行此操作,请尝试perl:

sub concat {
  my($x,$y) = @_;

  return $x if $y eq '';
  return $y if $x eq '';

  my($i) = length($x) < length($y) ?  length($x) : length($y);
  while($i > 0) {
      if( substr($x,-$i) eq substr($y,0,$i) )  {
          return $x . substr($y,$i);
      }
      $i--;
  }
  return $x . $y;
}
sub-concat{
我的($x,$y)=@;
如果$y等式为“”,则返回$x;
如果$x eq'',则返回$y;
my($i)=长度($x)<长度($y)?长度($x):长度($y);
而($i>0){
if(substr($x,$i)eq substr($y,0,$i)){
返回$x.substr($y,$i);
}
$i--;
}
返回$x.$y;
}

它与您的算法完全相同,如果java或perl更快,我只是好奇;-)

您可以使用DFA。例如,字符串
XYZ
应该由正则表达式
^((a)B)读取?C
。该正则表达式将匹配与
XYZ
字符串后缀匹配的最长前缀。使用该正则表达式,您可以匹配并获得匹配结果,或者生成DFA,您可以在其上使用状态指示“剪切”的正确位置

在Scala中,第一个实现——直接使用正则表达式——可能是这样的:

def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
  val regex = toRegex(s1)
  val prefix = regex findFirstIn s2 getOrElse ""
  s1 + s2.drop(prefix length)
}
例如:

scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd

scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef

scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn

或者,您也可以在mysql中使用以下存储函数执行此操作:

DELIMITER //

DROP FUNCTION IF EXISTS concat_with_overlap //

CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
  RETURNS VARCHAR(200) DETERMINISTIC
BEGIN 
  DECLARE i INT;
  DECLARE al INT;
  DECLARE bl INT;
  SET al = LENGTH(a);
  SET bl = LENGTH(a);
  IF al=0 THEN 
    RETURN b;
  END IF;
  IF bl=0 THEN 
    RETURN a;
  END IF;
  IF al < bl THEN
     SET i = al;
  ELSE
     SET i = bl;
  END IF;

  search: WHILE i > 0 DO
     IF RIGHT(a,i) = LEFT(b,i) THEN
    RETURN CONCAT(a, SUBSTR(b,i+1));
     END IF;
     SET i = i - 1;
  END WHILE search;

  RETURN CONCAT(a,b);
END//
(请原谅我的错误)怎么样

在确定需要复制的内容之前,此实现不会进行任何字符串复制(部分复制或其他复制),这将大大提高性能

此外,匹配检查首先测试潜在匹配区域(2个单个字符)的极值,在正常英文文本中,这将有助于避免检查任何其他字符的不匹配


只有当它建立了它能进行的最长匹配,或者根本不可能进行匹配时,两个字符串才会连接起来。我在这里使用了简单的“+”,因为我认为对算法其余部分的优化已经消除了原始算法中的大部分低效率。请尝试一下,让我知道它是否适合您的purposes.

这里有一个Python解决方案。它应该更快,不需要一直在内存中构建子字符串。这项工作是在_concat函数中完成的,它连接两个字符串。concat函数是连接任意数量字符串的助手

def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'
def concat(*args):
结果=“”
对于args中的arg:
结果=_concat(结果,arg)
返回结果
def_concat(a,b):
la=len(a)
lb=len(b)
对于范围内的i(la):
j=i
k=0
而j
我认为这将非常快:

您有两个字符串,string1和string2。通过string1向后(从右到左)查看string2的第一个字符。一旦找到该位置,确定是否有重叠。如果没有重叠,则需要继续搜索。如果有,则需要确定是否存在其他匹配的可能性

要做到这一点,只需探索两个字符串中较短的一个,以重复重叠字符。即:如果string1中的匹配位置剩余一个较短的string1,则从string1中的新起点重复初始搜索。相反,如果string2中不匹配的部分较短,则搜索它以重复重叠字符重叠字符

根据需要重复上述步骤

工作完成了

这在内存分配方面不需要太多(所有搜索都在适当的位置完成,只需要分配结果字符串缓冲区),并且只需要(最多)一次重叠字符串的传递。

我正试图使这个C尽可能地易于阅读
public static string OverlapConcat(string s1, string s2)
{
    // Handle nulls... never return a null
    if (string.IsNullOrEmpty(s1))
    {
        if (string.IsNullOrEmpty(s2))
            return string.Empty;
        else
            return s2;
    }
    if (string.IsNullOrEmpty(s2))
        return s1;

    // Checks above guarantee both strings have at least one character
    int len1 = s1.Length - 1;
    char last1 = s1[len1];
    char first2 = s2[0];

    // Find the first potential match, bounded by the length of s1
    int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
    while (indexOfLast2 != -1)
    {
        if (s1[len1 - indexOfLast2] == first2)
        {
            // After the quick check, do a full check
            int ix = indexOfLast2;
            while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
                ix--;
            if (ix == -1)
                return s1 + s2.Substring(indexOfLast2 + 1);
        }

        // Search for the next possible match
        indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
    }

    // No match found, so concatenate the full strings
    return s1 + s2;
}
def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'
    public static string Concatenate(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1)) return s2;
        if (string.IsNullOrEmpty(s2)) return s1;
        if (s1.Contains(s2)) return s1;
        if (s2.Contains(s1)) return s2;

        char endChar = s1.ToCharArray().Last();
        char startChar = s2.ToCharArray().First();

        int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
        int overlapLength = s1.Length - s1FirstIndexOfStartChar;

        while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
        {
            if (CheckOverlap(s1, s2, overlapLength))
            {
                return s1 + s2.Substring(overlapLength);
            }

            s1FirstIndexOfStartChar = 
                s1.IndexOf(startChar, s1FirstIndexOfStartChar);
            overlapLength = s1.Length - s1FirstIndexOfStartChar;

        }

        return s1 + s2;
    }

    private static bool CheckOverlap(string s1, string s2, int overlapLength)
    {
        if (overlapLength <= 0)
            return false;

        if (s1.Substring(s1.Length - overlapLength) == 
            s2.Substring(0, overlapLength))
            return true;

        return false;            
    }
    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }
static class Candidate {
    int matchLen = 0;
}

private String overlapOnce(@NotNull final String a, @NotNull final String b) {
    final int maxOverlap = Math.min(a.length(), b.length());
    final Collection<Candidate> candidates = new LinkedList<>();
    for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
        if (a.charAt(i) == b.charAt(0)) {
            candidates.add(new Candidate());
        }
        for (final Iterator<Candidate> it = candidates.iterator(); it.hasNext(); ) {
            final Candidate candidate = it.next();
            if (a.charAt(i) == b.charAt(candidate.matchLen)) {
                //advance
                ++candidate.matchLen;
            } else {
                //not matching anymore, remove
                it.remove();
            }
        }

    }
    final int matchLen = candidates.isEmpty() ? 0 :
            candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
    return a + b.substring(matchLen);
}

private String overlapOnce(@NotNull final String... strings) {
    return Arrays.stream(strings).reduce("", this::overlapOnce);
}
@Test
public void testOverlapOnce() throws Exception {
    assertEquals("", overlapOnce("", ""));
    assertEquals("ab", overlapOnce("a", "b"));
    assertEquals("abc", overlapOnce("ab", "bc"));
    assertEquals("abcdefghqabcdefghi", overlapOnce("abcdefgh", "efghqabcdefghi"));
    assertEquals("aaaaaabaaaaaa", overlapOnce("aaaaaab", "baaaaaa"));
    assertEquals("ccc", overlapOnce("ccc", "ccc"));
    assertEquals("abcabc", overlapOnce("abcabc", "abcabc"));

    /**
     *  "a" + "b" + "c" => "abc"
     "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
     "abcdede" + "dedefgh" + "" => "abcdedefgh"
     "abcde" + "d" + "ghlmn" => "abcdedghlmn"
     "abcdef" + "" + "defghl" => "abcdefghl"
     */
    assertEquals("abc", overlapOnce("a", "b", "c"));
    assertEquals("abcdefghlmn", overlapOnce("abcde", "defgh", "ghlmn"));
    assertEquals("abcdedefgh", overlapOnce("abcdede", "dedefgh"));
    assertEquals("abcdedghlmn", overlapOnce("abcde", "d", "ghlmn"));
    assertEquals("abcdefghl", overlapOnce("abcdef", "", "defghl"));


    // Consider str1=abXabXabXac and str2=XabXac. Your approach will output abXabXabXacXabXac because by
    // resetting j=0, it goes to far back.
    assertEquals("abXabXabXac", overlapOnce("abXabXabXac", "XabXac"));

    // Try to trick algo with an earlier false match overlapping with the real match
    //  - match first "aba" and miss that the last "a" is the start of the
    // real match
    assertEquals("ababa--", overlapOnce("ababa", "aba--"));
}