Java 查找大型集合中两个字符串的所有串联_Java_Algorithm_String Algorithm

Java 查找大型集合中两个字符串的所有串联

java algorithm

Java 查找大型集合中两个字符串的所有串联,java,algorithm,string-algorithm,Java,Algorithm,String Algorithm,给定一组50k字符串，我需要找到所有对（s，t），这样s、t和s+t都包含在这个集合中我试过的，还有一个附加约束：s.length（）>=4&&t.length（）>=4。这样就可以按长度4前缀和后缀对字符串进行分组。然后，对于长度至少为8的每个字符串组合的，我使用组合的的前四个字符查找s的候选集，并使用其最后四个字符查找t的候选集。这是可行的，但需要查看3000万个候选对（s，t），才能找到7k个结果如此多的候选词来自这样一个事实：字符串（大部分是德语）来自有限的词汇表，并且单词的开头和

给定一组50k字符串，我需要找到所有对

（s，t）

，这样

、

和

s+t

都包含在这个集合中

我试过的，还有一个附加约束：

s.length（）>=4&&t.length（）>=4

。这样就可以按长度4前缀和后缀对字符串进行分组。然后，对于长度至少为8的每个字符串

组合的，我使用组合的的前四个字符查找s
的候选集，并使用其最后四个字符查找t
的候选集。这是可行的，但需要查看3000万个候选对（s，t）
，才能找到7k个结果
如此多的候选词来自这样一个事实：字符串（大部分是德语）来自有限的词汇表，并且单词的开头和结尾通常相同。它仍然比尝试所有2.5G对要好得多，但比我希望的要糟糕得多
我需要什么
由于附加约束可能会被删除，并且集合会增加，因此我正在寻找更好的算法
“缺失”问题
有人抱怨我一个问题也没问。所以缺少的问号在下一句的末尾。理想情况下，如何在不使用约束的情况下更有效地执行此操作？一个可能的解决方案可能是这样的。
第一个字符串作为前缀，第二个字符串作为后缀。
你检查每根绳子。如果字符串以第一个字符串开头，则检查它是否以第二个字符串结尾。一直走到最后。为了在检查字母本身是否相同之前节省一些时间，您可以进行长度检查。
这和你做的差不多，但是有了这个额外的长度检查，你也许可以修剪一些。这至少是我的看法。
不确定这是否比您的解决方案好，但我认为值得一试
构建两个，一个按正常顺序排列候选词，另一个颠倒单词
从深度4
向内向前移动Trie
，并使用叶的剩余部分确定后缀（或类似的东西），然后向后查看Trie

我过去在这里发布了一个Trie
实现。
算法1：测试对，而不是单子
一种方法是，不要从所有可能的对到包含这些对的所有可能的组合字符串，而是从所有可能的组合字符串开始，看看它们是否包含对。这将问题从n^2
查找（其中n
是字符串数>=4个字符）更改为m*n
查找（其中m
是所有字符串的平均长度>=8个字符，减去7，n
现在是字符串数>=8个字符）。下面是一个实现：
int minWordLength = 4;
int minPairLength = 8;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downstream", "stream"
   )
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

strings
   .stream()
   .filter(s -> s.length() >= minPairLength)
   .flatMap(s -> IntStream
      .rangeClosed(minWordLength, s.length() - minWordLength)
      .mapToObj(splitIndex -> ImmutableList.of(
         s.substring(0, splitIndex),
         s.substring(splitIndex)
      ))
      .filter(pair ->
          strings.contains(pair.get(0))
          && strings.contains(pair.get(1))
      )
   )
   .map(pair ->
      pair.get(0) + pair.get(1) + " = " + pair.get(0) + " + " + pair.get(1)
   )
   .forEach(System.out::println);

其平均算法复杂度为m*n
，如上所示。所以实际上，O（n）
。在最坏的情况下，O（n^2）
。有关算法复杂性的更多信息，请参阅
解释
将所有四个或四个以上字符长的字符串放入一个哈希集中（搜索的平均复杂度为O（1））。为了方便起见，我使用了番石榴的ImmutableSet
。用你喜欢的任何东西
filter
：仅限于长度为八个或八个以上字符的项，表示我们的候选项是列表中另外两个单词的组合
flatMap：对于每个候选词，计算所有可能的子词对，确保每个子词至少有4个字符长。由于可以有多个结果，这实际上是一个列表列表，所以将其展平为一个深度列表。
rangeClosed
：生成所有整数，表示我们将检查的对的第一个字中的字符数
mapToObj
：将每个整数与候选字符串结合使用，输出两个项目的列表（在生产代码中，您可能需要更清晰的内容，如双属性值类或适当的现有类）
过滤器
：仅限于列表中两个都有的对

map
：稍微整理一下结果
forEach
：输出到控制台
算法选择
此算法调整为比列表中的项目数短得多的单词。如果列表很短，单词很长，那么切换回合成任务而不是分解任务会更好。考虑到列表的大小为50000个字符串，而德语单词的长度不太可能超过50个字符，这是支持此算法的1:1000因素
另一方面，如果您有50个平均长度为50000个字符的字符串，则使用不同的算法将更加有效
算法2：排序并保留候选列表
我考虑了一会儿的一个算法是对列表进行排序，因为我知道如果一个字符串代表一对的开头，那么所有可能是它的一对的候选字符串将按照顺序紧跟在它之后，在以该字符串开头的一组项中。对上述棘手的数据进行排序，并添加一些混杂因素（downer，downs，downlegate
），我们得到：
因此，如果保留所有要检查的项目的运行集，我们可以在每个单词的基本恒定时间内找到候选组合，然后直接探测剩余单词的哈希表：
int minWordLength = 4;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downs", "downer", "downregulate", "downstream", "stream")
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

ImmutableList<String> orderedList = strings
   .stream()
   .sorted()
   .collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();

for (String currentString : orderedList) {
   List<String> nextCandidates = new ArrayList<>();
   nextCandidates.add(currentString);
   for (String candidate : candidates) {
      if (currentString.startsWith(candidate)) {
         nextCandidates.add(candidate);
         String remainder = currentString.substring(candidate.length());
         if (remainder.length() >= minWordLength && strings.contains(remainder)) {
            pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
         }
      }
   }
   candidates = nextCandidates;
}
pairs.forEach(System.out::println);

这个问题的算法复杂度稍微复杂一点。我认为搜索部分是O（n）
average，最坏的情况是O（n^2）。最昂贵的部分可能是排序，这取决于使用的算法和未排序数据的特征。所以使用
a
abc
abcdef
bear
bearhug
cur
curl
curlique
def
down ---------\
downs         |
downer        | not far away now!
downregulate  |
downstream ---/
hug
shine
stream
sun
sunshine

int minWordLength = 4;

Set<String> strings = Stream
   .of(
      "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
      "bear", "hug", "bearhug", "cur", "curlique", "curl",
      "down", "downs", "downer", "downregulate", "downstream", "stream")
   .filter(s -> s.length() >= minWordLength)
   .collect(ImmutableSet.toImmutableSet());

ImmutableList<String> orderedList = strings
   .stream()
   .sorted()
   .collect(ImmutableList.toImmutableList());
List<String> candidates = new ArrayList<>();
List<Map.Entry<String, String>> pairs = new ArrayList<>();

for (String currentString : orderedList) {
   List<String> nextCandidates = new ArrayList<>();
   nextCandidates.add(currentString);
   for (String candidate : candidates) {
      if (currentString.startsWith(candidate)) {
         nextCandidates.add(candidate);
         String remainder = currentString.substring(candidate.length());
         if (remainder.length() >= minWordLength && strings.contains(remainder)) {
            pairs.add(new AbstractMap.SimpleEntry<>(candidate, remainder));
         }
      }
   }
   candidates = nextCandidates;
}
pairs.forEach(System.out::println);

down=stream

Set<CharBuffer> strings = Stream.of(
    "a", "abc", "abcdef", "def", "sun", "sunshine", "shine",
    "bear", "hug", "bearhug", "cur", "curlique", "curl",
    "down", "downstream", "stream"
 )
.filter(s -> s.length() >= 4) // < 4 is irrelevant
.map(CharBuffer::wrap)
.collect(Collectors.toSet());

strings
    .stream()
    .filter(s -> s.length() >= 8)
    .map(CharBuffer::wrap)
    .flatMap(cb -> IntStream.rangeClosed(4, cb.length() - 4)
        .filter(i -> strings.contains(cb.clear().position(i))&&strings.contains(cb.flip()))
        .mapToObj(i -> cb.clear()+" = "+cb.limit(i)+" + "+cb.clear().position(i))
    )
    .forEach(System.out::println);