String 如何从字典中将给定的文本分解成单词？_String_Algorithm_Language Agnostic_Data Structures

String 如何从字典中将给定的文本分解成单词？

string algorithm language-agnostic data-structures

String 如何从字典中将给定的文本分解成单词？,string,algorithm,language-agnostic,data-structures,String,Algorithm,Language Agnostic,Data Structures,这是一个面试问题。假设您有一个字符串文本和一个字典（一组字符串）。如何将文本分解为子字符串，以便在字典中找到每个子字符串例如，您可以使用/usr/share/dict/words将“thisisatext”分解为[“this”、“is”、“a”、“text”] 我相信回溯可以解决这个问题（在伪Java中）：无效解算（字符串s、集合dict、列表解算）{ 如果（s.length==0）返回对于在dict中找到的s的每个前缀解算（不带前缀、dict、解算+前缀） } 列表解决方案=新列表（

这是一个面试问题。假设您有一个字符串

文本

和一个

字典

（一组字符串）。如何将

文本

分解为子字符串，以便在

字典

中找到每个子字符串

例如，您可以使用

/usr/share/dict/words

将

“thisisatext”分解为[“this”、“is”、“a”、“text”]

我相信回溯可以解决这个问题（在伪Java中）：
无效解算（字符串s、集合dict、列表解算）{
如果（s.length==0）
返回
对于在dict中找到的s的每个前缀
解算（不带前缀、dict、解算+前缀）
}
列表解决方案=新列表（）
解答（文本、口述、解答）
这有意义吗？你会优化在字典中搜索前缀的步骤吗？您会推荐什么样的数据结构 方法1-
这里看起来很合适。生成英语词典中的单词。这幢trie大楼是一次性的。构建trie之后，您的字符串
可以很容易地逐字进行比较。如果您在trie中的任何时候遇到一片叶子，您可以假设您找到了一个单词，将其添加到列表中并继续遍历。进行遍历，直到到达字符串的末尾。列表将被输出
搜索的时间复杂度-O（单词长度）
空间复杂度-O（字符大小*字长*无字）。你字典的大小
方法2-我听说过，从未使用过，但在这里可能有用
方法3-更迂腐&一个糟糕的选择。你已经提出了这个建议
你可以试试另一种方法。运行dict
检查子字符串是否匹配。这里我假设dict
中的键是英语词典/usr/share/dict/words
中的单词。所以psuedo代码看起来像这样-
(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

复杂度-O（n）贯穿整个dict+O（1）进行子字符串匹配
空格-如果len（单词）==len（dict）

正如其他人所指出的，这确实需要回溯。
在这篇文章中，有一篇非常详尽的文章来解决这个问题
基本思想是只需记忆您编写的函数，就可以得到一个O（n^2）时间，O（n）空间算法。
此解决方案假设字典存在Trie数据结构。此外，对于Trie中的每个节点，假定具有以下功能：
node.IsWord（）：如果指向该节点的路径是单词，则返回true
IsChild（char x）：如果存在标签为x的子级，则返回true
GetChild（char x）：返回标签为x的子节点
函数注释（字符串str，int start，int end，int root[]，三节点）：
i=开始
当i=0时：
注释（str、开始、结束、根、根）
指数01 2 3 4 5 6 7 8 9 10 11
str:THHISISATEXT
根：-1-1-1-10-146-1617

我将把这部分留给您，让您通过反向遍历根来列出组成字符串的单词
时间复杂度为O（nk），其中n是字符串的长度，k是字典中最长单词的长度
PS：我假设字典中有以下单词：this，is，a，text，ate。
你可以使用and解决这个问题
计算字典中每个单词的哈希值。使用你最喜欢的哈希函数。我会使用类似于（a1*B^（n-1）+a2*B^（n-2）+…+an*B^0）%P的东西，其中a1a2…an是字符串，n是字符串的长度，B是多项式的基，P是大素数。如果有字符串a1a2…an的哈希值，则可以在恒定时间内计算字符串a1a2…ana（n+1）的哈希值：（哈希值（a1a2…an）*B+a（n+1））%P
这部分的复杂度是O（N*M），其中N是字典中的单词数，M是字典中最长单词的长度
然后，使用如下DP函数：
   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }

bool-vis[字符串的长度]；
bool go（字符str[]，整数长度，整数位置）
{
int i；
//您找到了一组可以解决任务的单词。
如果（位置==长度）{
返回true；
}
//你已经访问过这个职位。你以前没有运气，显然这次你不会有运气。
如果（相对于[位置]）{
返回false；
}
//将此位置标记为已访问。
vis[位置]=真；
//一种可能的改进是，当子字符串（位置i）的长度大于字典中最长单词的长度时，停止此循环。
对于（i=位置；位置<长度；i++）{
//计算子字符串str（位置i）的散列值；
if（hashValue在dict中）{
//您可以将子字符串str（i+1，length）划分到字典中的一组单词中。
如果（go（i+1））{
//在给定位置使用hashValue对应的单词并返回true，因为您找到了子字符串str（位置，长度）的分区。
返回true；
}
}
}
返回false；
}

这个算法的复杂度是O（N*M），其中N是字符串的长度，M是字典中最长单词的长度，或者O（N^2），这取决于您是否编码了改进
因此，该算法的总复杂度为：O（N1*M）+O（N2*M）（或O（N2^2）），其中N1是字典中的单词数，M是字典中最长单词的长度，N2是字符串的长度）
如果你想不出一个好的散列函数（没有任何冲突），其他可能的解决方案是使用Tries或Patricia-trie（如果普通trie的大小非常大）（我不能
Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }