Nlp 检测单词中的音节_Nlp_Spell Checking_Hyphenation

Nlp 检测单词中的音节

nlp

Nlp 检测单词中的音节,nlp,spell-checking,hyphenation,Nlp,Spell Checking,Hyphenation,我需要找到一种相当有效的方法来检测单词中的音节。例如：不可见->在虚拟系统中可以使用一些音节规则：五个人简历风险投资 CVC CCV CCCV CVCC *其中V是元音，C是辅音。例如：发音（5 Pro-nun-ci-a-tion；CV-CVC-CV-V-CVC）我尝试过几种方法，其中包括使用正则表达式（只有当你想计算音节时才有用）或硬编码规则定义（一种证明效率非常低的蛮力方法），以及最后使用有限状态自动机（这并没有产生任何有用的结果）我的应用程序的目的是创建一个包含给定语言中

我需要找到一种相当有效的方法来检测单词中的音节。例如：

不可见->在虚拟系统中

可以使用一些音节规则：

五个人简历风险投资 CVC CCV CCCV CVCC

*其中V是元音，C是辅音。例如：

发音（5 Pro-nun-ci-a-tion；CV-CVC-CV-V-CVC）

我尝试过几种方法，其中包括使用正则表达式（只有当你想计算音节时才有用）或硬编码规则定义（一种证明效率非常低的蛮力方法），以及最后使用有限状态自动机（这并没有产生任何有用的结果）

我的应用程序的目的是创建一个包含给定语言中所有音节的词典。本词典稍后将用于拼写检查应用程序（使用贝叶斯分类器）和文本到语音合成

如果有人能给我一些建议，除了我以前的方法之外，还有别的方法来解决这个问题，我将不胜感激

我用Java工作，但C/C++、C#、Python、Perl中的任何技巧。。。将对我有用。

为了连字号的目的，阅读关于这个问题的TeX方法。特别是看康普特写的弗兰克·梁的“Hy-phen-a-tion”。他的算法非常精确，然后在算法不起作用的情况下包含一个小的异常字典。

Perl有一个模块。你可以试试看，或者试试看它的算法。我在那里也看到了一些其他的老模块

我不明白为什么正则表达式只给出音节数。您应该能够使用捕获括号获取音节本身。假设您可以构造一个有效的正则表达式，也就是说。

我无意中发现了这一页，寻找相同的东西，并在这里找到了本文的一些实现：或继任者：

除非你喜欢读一篇60页的论文，而不是为非唯一问题修改免费的代码

为什么要计算它？每个在线词典都有这个信息。在·vis·i·ble

中，这里有一个解决方案，使用：

这是一个特别困难的问题，LaTeX断字算法并没有完全解决这个问题。论文（Marchand、Adsett和Damper 2007）对一些可用的方法和所涉及的挑战进行了很好的总结。

我试图通过一个程序来解决这个问题，该程序将计算文本块的flesch-kincaid和flesch阅读分数。我的算法使用了我在这个网站上找到的东西：它相当接近。它在诸如隐形和连字号这样复杂的词上仍然有困难，但我发现它在我的目的上有一定的难度

它的优点是易于实现。我发现“es”可以是音节的，也可以不是音节的。这是一场赌博，但我决定删除算法中的es

private int count音节（字符串字）
{
char[]元音={'a'，'e'，'i'，'o'，'u'，'y'}；
字符串currentWord=word；
int numowels=0；
bool lastwasvonel=false；
foreach（currentWord中的字符wc）
{
布尔元音=假；
foreach（元音中的char v）
{
//不要数双元音
if（v==wc&&lastwas元音）
{
元音=真；
Lastwas元音=真；
打破
}
else if（v==wc&！lastwas元音）
{
numowels++；
元音=真；
Lastwas元音=真；
打破
}
}
//如果完整循环且未找到元音，则将LastWas元音设置为false；
if（！found元音）
Lastwas元音=假；
}
//除去这些，通常都是无声的
如果（currentWord.Length>2&&
currentWord.Substring（currentWord.Length-2）=“es”）
numVowels——；
//删除静默e
如果（currentWord.Length>1），则为else&&
currentWord.Substring（currentWord.Length-1）=“e”）
numVowels——；
返回numVowels；
}

感谢Joe Basirico在C#中分享您的快速而肮脏的实现。我使用过大型库，它们很有效，但它们通常有点慢，对于快速项目，您的方法很有效

以下是您的Java代码以及测试用例：

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

结果与预期一致（对Flesch Kincaid来说效果足够好）：

谢谢乔·巴西里科和蒂哈默。我已经将@tihamer的代码移植到Lua5.1、5.2和Luajit2（很可能也会在Lua的其他版本上运行）：

count音节。lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

和一些有趣的测试来确认它的工作（尽可能多）：

count音节。测试。lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

我找不到一种计算音节的合适方法，所以我自己设计了一种方法

您可以在此处查看我的方法：

我使用字典和算法相结合的方法来计算音节

您可以在此处查看我的库：

我刚刚测试了我的算法，有99.4%的命中率

lawrencelawrence=newlawrence（）；
System.out.println（lawrence.gethypel（“断字”）；
System.out.println（lawrence.getSymplete（“计算机”）；

输出：

4
3

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']

撞上蒂哈默和乔·巴斯里科。非常有用的功能，不是完美的，但适合大多数中小型项目。Joe，我已经用Python重新编写了您的代码实现：

def计数音节（单词）：
元音=“aeiouy”
numowels=0
Lastwas元音=假
对于word中的wc：
元音=假
对于元音中的v：
如果v==wc：
如果不是LastWas元音：numVowels+=1#不要计算双元音
foundvowell=lastWasVowel=True
打破
如果未找到元音：#如果整个周期未找到元音，则设置为最后一个
    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

countSyllablesInWord = function(words)
  {
  #word = "super";
  n.words = length(words);
  result = list();
  for(j in 1:n.words)
    {
    word = words[j];
    vowels = c("a","e","i","o","u","y");
    
    word.vec = strsplit(word,"")[[1]];
    word.vec;
    
    n.char = length(word.vec);
    
    is.vowel = is.element(tolower(word.vec), vowels);
    n.vowels = sum(is.vowel);
    
    
    # nontrivial problem 
    if(n.vowels <= 1)
      {
      syllables = 1;
      str = word;
      } else {
              # syllables = 0;
              previous = "C";
              # on average ? 
              str = "";
              n.hyphen = 0;
        
              for(i in 1:n.char)
                {
                my.char = word.vec[i];
                my.vowel = is.vowel[i];
                if(my.vowel)
                  {
                  if(previous == "C")
                    {
                    if(i == 1)
                      {
                      str = paste0(my.char, "-");
                      n.hyphen = 1 + n.hyphen;
                      } else {
                              if(i < n.char)
                                {
                                if(n.vowels > (n.hyphen + 1))
                                  {
                                  str = paste0(str, my.char, "-");
                                  n.hyphen = 1 + n.hyphen;
                                  } else {
                                           str = paste0(str, my.char);
                                          }
                                } else {
                                        str = paste0(str, my.char);
                                        }
                              }
                     # syllables = 1 + syllables;
                     previous = "V";
                    } else {  # "VV"
                          # assume what  ?  vowel team?
                          str = paste0(str, my.char);
                          }
            
                } else {
                            str = paste0(str, my.char);
                            previous = "C";
                            }
                #
                }
        
              syllables = 1 + n.hyphen;
              }
  
      result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
      }
  
  if(n.words == 1) { result[[1]]; } else { result; }
  }

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));

my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);

my.count.df;

#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty


################ hackathon #######


# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word



# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/


  # https://enchantedlearning.com/consonantblends/index.shtml
  # start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr", 
  #                   "fl", "fr", "gl", "gr", "pl", "pr",
  #                   "sc", "sh", "sk", "sl", "sm", "sn",
  #                   "sp", "st", "sw", "th", "tr", "tw",
  #                   "wh", "wr");
  # start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
  #                     "spr", "squ", "str", "thr");
  # 
  # 
  # 
  # end.digraphs = c("ch","sh","th","ng","dge","tch");
  # 
  # ile
  # 
  # farmer
  # ar er
  # 
  # vowel teams ... beaver1
  # 
  # 
  # # "able"
  # # http://www.abcfastphonics.com/letter-blends/blend-cial.html
  # blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian", 
  #             "ck", "ct", "dge", "dis", "ed", "ex", "ful", 
  #             "gh", "ng", "ous", "kn", "ment", "mis", );
  # 
  # glue = c("ld", "st", "nd", "ld", "ng", "nk", 
  #           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch", 
  #           "nse", "nt", "ph", "psy", "pt", "re", )
  # 
  # 
  # start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
  # 
  # # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
  # digraphs.start = c("ch","sh","th","wh","ph","qu");
  # digraphs.end = c("ch","sh","th","ng","dge","tch");
  # # https://www.education.com/worksheet/article/beginning-consonant-blends/
  # blends.start = c("pl", "gr", "gl", "pr",
  #                 
  # blends.end = c("lk","nk","nt",
  # 
  # 
  # # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
  # # Monte     Mon-te
  # # Sophia    So-phi-a
  # # American  A-mer-i-can
  # 
  # n.vowels = 0;
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  # 
  # 
  # 
  # 
  # 
  # n.syll = 0;
  # str = "";
  # 
  # previous = "C"; # consonant vs "V" vowel
  # 
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  #   
  #   my.vowel = is.element(tolower(my.char), vowels);
  #   if(my.vowel)
  #     {
  #     n.vowels = 1 + n.vowels;
  #     if(previous == "C")
  #       {
  #       if(i == 1)
  #         {
  #         str = paste0(my.char, "-");
  #         } else {
  #                 if(n.syll > 1)
  #                   {
  #                   str = paste0(str, "-", my.char);
  #                   } else {
  #                          str = paste0(str, my.char);
  #                         }
  #                 }
  #        n.syll = 1 + n.syll;
  #        previous = "V";
  #       } 
  #     
  #   } else {
  #               str = paste0(str, my.char);
  #               previous = "C";
  #               }
  #   #
  #   }
  # 
  # 
  # 
  # 
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE  1
# HALIDE  2
# TELEPHONE 3
# TELEPHONY 4
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# SEANCE  1
# SAILED  1
# RIBBED  1
# MOPED  1
# BLESSED  1
# AGED  1
# TOTED  2
# WARRED  1
# UNDERFED 2
# JADED  2
# INBRED  2
# BRED  1
# RED   1
# STATES  1
# TASTES  1
# TESTES  1
# UTILIZES  4

computeReadability = function(n.sentences, n.words, syllables=NULL)
  {
  n = length(syllables);
  n.syllables = 0;
  for(i in 1:n)
    {
    my.syllable = syllables[[i]];
    n.syllables = my.syllable$syllables + n.syllables;
    }
  # Flesch Reading Ease (FRE):
  FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
  # Flesh-Kincaid Grade Level (FKGL):
  FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59; 
  # FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
  # FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  
  list("FRE" = FRE, "FKGL" = FKGL); 
  }

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']