无法通过Ruby组合字母中的英语单词

无法通过Ruby组合字母中的英语单词,ruby,knuth,Ruby,Knuth,我需要找到所有可以由字符串中的字母组成的英语单词 sentence="Ziegler's Giant Bar" 我可以用计算机制作字母数组 sentence.split(//) 如何从Ruby中的句子中提取4500多个英语单词? [编辑] 最好将问题分成几个部分: 仅由10个字母或更少的单词组成数组 较长的单词可以单独查找 我认为鲁比没有英语词典。但您可以尝试将原始字符串的所有排列存储在一个数组中,并对照Google检查这些字符串?如果一个单词的点击率超过10万次或是其他什么,那么就

我需要找到所有可以由字符串中的字母组成的英语单词

 sentence="Ziegler's Giant Bar"
我可以用计算机制作字母数组

 sentence.split(//)  
如何从Ruby中的句子中提取4500多个英语单词?

[编辑]

最好将问题分成几个部分:

  • 仅由10个字母或更少的单词组成数组
  • 较长的单词可以单独查找

  • 我认为鲁比没有英语词典。但您可以尝试将原始字符串的所有排列存储在一个数组中,并对照Google检查这些字符串?如果一个单词的点击率超过10万次或是其他什么,那么就说它实际上是一个单词?

    我认为Ruby没有英文词典。但您可以尝试将原始字符串的所有排列存储在一个数组中,并对照Google检查这些字符串?假设一个单词实际上是一个单词,如果点击次数超过100.000次或是其他什么?

    您可以得到如下字母数组:

    sentence = "Ziegler's Giant Bar"
    letters = sentence.split(//)
    

    您可以获得如下字母数组:

    sentence = "Ziegler's Giant Bar"
    letters = sentence.split(//)
    

    如果您想查找其字母和频率受给定短语限制的单词, 您可以构造一个正则表达式来执行此操作:

    sentence = "Ziegler's Giant Bar"
    
    # count how many times each letter occurs in the 
    # sentence (ignoring case, and removing non-letters)
    counts = Hash.new(0)
    sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
      counts[letter] += 1
    end
    letters = counts.keys.join
    length = counts.values.inject { |a,b| a + b }
    
    # construct a regex that matches upto that many occurences
    # of only those letters, ignoring non-letters
    # (in a positive look ahead)
    length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
    # construct regexes that matches each letter up to its
    # proper frequency (in a positive look ahead)
    count_regexes = counts.map do |letter, count|
      /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
    end
    
    # combine the regexes, to form a regex that will only
    # match words that are made of a subset of the letters in the string
    regex = /#{length_regex}#{count_regexes.join('')}/
    
    # open a big file of words, and find all the ones that match
    words = File.open("/usr/share/dict/words") do |f|
      f.map { |word| word.chomp }.find_all { |word| regex =~ word }
    end
    
    words.length #=> 3182
    words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
              "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
              ...
              "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
              "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
              ...
              "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
              "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
              ...
              "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
              "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
              ...
              "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
              "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
              ...
              "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
              "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
              ...
              "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
              "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
              ...
              "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
              "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
              ...
              "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
              "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
              ...
              "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
              "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
              ...
              "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
              "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]
    
    正lookaheads允许您生成一个正则表达式,该正则表达式与字符串中某些指定模式匹配的位置相匹配,而不使用字符串中匹配的部分。 我们在这里使用它们将同一字符串与单个正则表达式中的多个模式进行匹配。 只有在所有模式匹配时,位置才匹配

    如果我们允许无限重复使用原始短语中的字母(就像Knuth根据的评论所做的那样),那么构建正则表达式就更容易了:

    sentence = "Ziegler's Giant Bar"
    
    # find all the letters in the sentence
    letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq
    
    # construct a regex that matches any line in which
    # the only letters used are the ones in the sentence
    regex = /^([^a-z]|[#{letters.join}])*$/i
    
    # open a big file of words, and find all the ones that match
    words = File.open("/usr/share/dict/words") do |f|
      f.map { |word| word.chomp }.find_all { |word| regex =~ word }
    end
    
    words.length #=> 6725
    words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
               ...
               "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
               "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
               ...
               "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
               "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
               "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
               ...
               "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
               "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
               "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
               ...
               "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
               "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
               ...
               "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
               "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
               ...
               "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
               "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
               "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
               ...
               "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
               "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
               "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
               ...
               "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
               "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
               ...
               "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
               "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
               "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
               ...
               "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
               "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
               ...
               "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
    

    如果您想查找其字母和频率受给定短语限制的单词, 您可以构造一个正则表达式来执行此操作:

    sentence = "Ziegler's Giant Bar"
    
    # count how many times each letter occurs in the 
    # sentence (ignoring case, and removing non-letters)
    counts = Hash.new(0)
    sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
      counts[letter] += 1
    end
    letters = counts.keys.join
    length = counts.values.inject { |a,b| a + b }
    
    # construct a regex that matches upto that many occurences
    # of only those letters, ignoring non-letters
    # (in a positive look ahead)
    length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
    # construct regexes that matches each letter up to its
    # proper frequency (in a positive look ahead)
    count_regexes = counts.map do |letter, count|
      /(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
    end
    
    # combine the regexes, to form a regex that will only
    # match words that are made of a subset of the letters in the string
    regex = /#{length_regex}#{count_regexes.join('')}/
    
    # open a big file of words, and find all the ones that match
    words = File.open("/usr/share/dict/words") do |f|
      f.map { |word| word.chomp }.find_all { |word| regex =~ word }
    end
    
    words.length #=> 3182
    words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
              "Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
              ...
              "ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
              "Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
              ...
              "eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
              "earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
              ...
              "gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
              "gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
              ...
              "Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
              "Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
              ...
              "laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
              "laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
              ...
              "Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
              "nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
              ...
              "Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
              "rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
              ...
              "sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
              "saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
              ...
              "tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
              "tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
              ...
              "zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
              "zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]
    
    正lookaheads允许您生成一个正则表达式,该正则表达式与字符串中某些指定模式匹配的位置相匹配,而不使用字符串中匹配的部分。 我们在这里使用它们将同一字符串与单个正则表达式中的多个模式进行匹配。 只有在所有模式匹配时,位置才匹配

    如果我们允许无限重复使用原始短语中的字母(就像Knuth根据的评论所做的那样),那么构建正则表达式就更容易了:

    sentence = "Ziegler's Giant Bar"
    
    # find all the letters in the sentence
    letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq
    
    # construct a regex that matches any line in which
    # the only letters used are the ones in the sentence
    regex = /^([^a-z]|[#{letters.join}])*$/i
    
    # open a big file of words, and find all the ones that match
    words = File.open("/usr/share/dict/words") do |f|
      f.map { |word| word.chomp }.find_all { |word| regex =~ word }
    end
    
    words.length #=> 6725
    words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
               ...
               "azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
               "Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
               ...
               "Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
               "brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
               "eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
               ...
               "eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
               "Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
               "gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
               ...
               "grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
               "I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
               ...
               "itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
               "L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
               ...
               "litter", "litterer", "little", "littleness", "littling", "littress", "litz",
               "Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
               "Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
               ...
               "niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
               "nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
               "rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
               ...
               "riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
               "s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
               ...
               "strigine", "string", "stringene", "stringent", "stringentness", "stringer",
               "stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
               "Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
               ...
               "tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
               "za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
               ...
               "Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
    

    [假设可以重用一个单词中的源字母]:对于字典列表中的每个单词,构造两个字母数组-一个用于候选单词,一个用于输入字符串。从字母的单词数组中减去字母的输入数组,如果没有剩余的字母,则得到一个匹配项。执行此操作的代码如下所示:

    def findWordsWithReplacement(sentence)
        out=[]
        splitArray=sentence.downcase.split(//)
        `cat /usr/share/dict/words`.each{|word|
            if (word.strip!.downcase.split(//) - splitArray).empty?
                out.push word
            end
         }
         return out
    end
    
    您可以从irb调试器调用该函数,如下所示:

    output=findWordsWithReplacement("some input string"); puts output.join(" ")
    
    …或者这里有一个包装器,您可以使用它从脚本以交互方式调用函数:

    puts "enter the text."
    ARGF.each {|line|
        puts "working..."
        out=findWordsWithReplacement(line)
        puts out.join(" ")
        puts "there were #{out.size} words."
    }
    
    在Mac上运行时,输出如下所示:

    def findWordsWithReplacement(sentence)
        out=[]
        splitArray=sentence.downcase.split(//)
        `cat /usr/share/dict/words`.each{|word|
            if (word.strip!.downcase.split(//) - splitArray).empty?
                out.push word
            end
         }
         return out
    end
    
    $./findwords.rb
    输入文本。
    齐格勒巨型酒吧
    工作…
    A A aa aal aalii Aani aba aba aba abaiser 珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝珠母贝 abaser Abasgi abasia Abassin ABASTABLE 阿巴特阿巴特阿巴斯阿巴斯阿巴 阿巴斯·阿巴斯·阿巴斯·阿巴斯·阿巴蒂尔女修道院院长 Abbie Abe abear Abel abele Abelia 交换交换交换树 异常异常异常教唆教唆 冷杉 亚比以谢亚比该亚比该亚比该亚比该亚比该 阿比吉阿比拉阿比恩泰斯
    […]
    Z Z 扎拜扎拜扎拜扎拜扎拜扎拜扎巴拉扎拜提 扎比·扎格·扎恩·扎内拉·扎特·扎特 Zanzalian zanze Zanzibari zar zaratite 扎勒巴·扎特·扎蒂·扎塔雷·泽亚热情 无热情无热情斑马 Zebrina zebrine zee zein zeist zel Zelanian Zeltinger Zenaga zenana zer zeta ziara ziarat zibeline zibet ziega zieger之字形 zigzagger Zilla锌凝胶生姜 生姜 Zirani Zirbanit Zirian Zirianian Zizania Zizia zizz
    总共有6725个单词

    这远远超过4500个单词,但这是因为Mac word字典相当大。如果您想准确地再现Knuth的结果,请从此处下载并解压缩Knuth的字典:并将“/usr/share/dict/words”替换为解压替代目录的路径。如果你做得对,你将得到4514个单词,以本系列结尾:

    滑稽可笑桑给巴尔扎赞 zeal斑马Zeiss zeitgeist Zen Zennist zestier zeta Ziegler zig Z字形Z字形Z字形Z字形Z字形Z字形 百日咳百日咳

    我相信这回答了最初的问题

    或者,提问者/读者可能希望列出可以从字符串构造的所有单词,而无需重复使用任何输入字母。我建议的代码如下:复制候选单词,然后对于输入字符串中的每个字母,从副本中以破坏性方式删除该字母的第一个实例(使用“slice!”)。如果这个过程吸收了所有的字母,接受这个单词

    def findWordsNoReplacement(sentence)
        out=[]
        splitInput=sentence.downcase.split(//)
        `cat /usr/share/dict/words`.each{|word|
            copy=word.strip!.downcase
            splitInput.each {|o| copy.slice!(o) }
            out.push word if copy==""
         }
         return out
    end
    

    [假设可以重用一个单词中的源字母]:对于字典列表中的每个单词,构造两个字母数组-一个用于候选单词,一个用于输入字符串。从字母的单词数组中减去字母的输入数组,如果没有剩余的字母,则得到一个匹配项。执行此操作的代码如下所示:

    def findWordsWithReplacement(sentence)
        out=[]
        splitArray=sentence.downcase.split(//)
        `cat /usr/share/dict/words`.each{|word|
            if (word.strip!.downcase.split(//) - splitArray).empty?
                out.push word
            end
         }
         return out
    end
    
    您可以从irb调试器调用该函数,如下所示:

    output=findWordsWithReplacement("some input string"); puts output.join(" ")
    
    …或者这里有一个包装器,您可以使用它从脚本以交互方式调用函数:

    puts "enter the text."
    ARGF.each {|line|
        puts "working..."
        out=findWordsWithReplacement(line)
        puts out.join(" ")
        puts "there were #{out.size} words."
    }
    
    在Mac上运行时,输出如下所示:

    def findWordsWithReplacement(sentence)
        out=[]
        splitArray=sentence.downcase.split(//)
        `cat /usr/share/dict/words`.each{|word|
            if (word.strip!.downcase.split(//) - splitArray).empty?
                out.push word
            end
         }
         return out
    end
    
    $./findwords.rb
    输入文本。
    齐格勒巨型酒吧
    工作…
    A A aa aal aalii Aani aba aba aba abaiser 珠母贝 abaser Abasgi abasia Abassin ABASTABLE 阿巴特阿巴特阿巴斯阿巴斯阿巴 阿巴斯·阿巴斯·阿巴斯·阿巴斯·阿巴蒂尔女修道院院长 Abbie Abe abear Abel abele Abelia 交换交换交换树 异常异常异常教唆教唆 冷杉