Nlp 检测单词中的音节
我需要找到一种相当有效的方法来检测单词中的音节。例如: 不可见->在虚拟系统中 可以使用一些音节规则: 五 个人简历 风险投资 CVC CCV CCCV CVCC *其中V是元音,C是辅音。 例如: 发音(5 Pro-nun-ci-a-tion;CV-CVC-CV-V-CVC) 我尝试过几种方法,其中包括使用正则表达式(只有当你想计算音节时才有用)或硬编码规则定义(一种证明效率非常低的蛮力方法),以及最后使用有限状态自动机(这并没有产生任何有用的结果) 我的应用程序的目的是创建一个包含给定语言中所有音节的词典。本词典稍后将用于拼写检查应用程序(使用贝叶斯分类器)和文本到语音合成 如果有人能给我一些建议,除了我以前的方法之外,还有别的方法来解决这个问题,我将不胜感激Nlp 检测单词中的音节,nlp,spell-checking,hyphenation,Nlp,Spell Checking,Hyphenation,我需要找到一种相当有效的方法来检测单词中的音节。例如: 不可见->在虚拟系统中 可以使用一些音节规则: 五 个人简历 风险投资 CVC CCV CCCV CVCC *其中V是元音,C是辅音。 例如: 发音(5 Pro-nun-ci-a-tion;CV-CVC-CV-V-CVC) 我尝试过几种方法,其中包括使用正则表达式(只有当你想计算音节时才有用)或硬编码规则定义(一种证明效率非常低的蛮力方法),以及最后使用有限状态自动机(这并没有产生任何有用的结果) 我的应用程序的目的是创建一个包含给定语言中
我用Java工作,但C/C++、C#、Python、Perl中的任何技巧。。。将对我有用。为了连字号的目的,阅读关于这个问题的TeX方法。特别是看康普特写的弗兰克·梁的“Hy-phen-a-tion”。他的算法非常精确,然后在算法不起作用的情况下包含一个小的异常字典。Perl有一个模块。你可以试试看,或者试试看它的算法。我在那里也看到了一些其他的老模块
我不明白为什么正则表达式只给出音节数。您应该能够使用捕获括号获取音节本身。假设您可以构造一个有效的正则表达式,也就是说。我无意中发现了这一页,寻找相同的东西,并在这里找到了本文的一些实现: 或继任者:
除非你喜欢读一篇60页的论文,而不是为非唯一问题修改免费的代码 为什么要计算它?每个在线词典都有这个信息。 在·vis·i·ble中,这里有一个解决方案,使用:
这是一个特别困难的问题,LaTeX断字算法并没有完全解决这个问题。论文(Marchand、Adsett和Damper 2007)对一些可用的方法和所涉及的挑战进行了很好的总结。我试图通过一个程序来解决这个问题,该程序将计算文本块的flesch-kincaid和flesch阅读分数。我的算法使用了我在这个网站上找到的东西:它相当接近。它在诸如隐形和连字号这样复杂的词上仍然有困难,但我发现它在我的目的上有一定的难度 它的优点是易于实现。我发现“es”可以是音节的,也可以不是音节的。这是一场赌博,但我决定删除算法中的es
private int count音节(字符串字)
{
char[]元音={'a','e','i','o','u','y'};
字符串currentWord=word;
int numowels=0;
bool lastwasvonel=false;
foreach(currentWord中的字符wc)
{
布尔元音=假;
foreach(元音中的char v)
{
//不要数双元音
if(v==wc&&lastwas元音)
{
元音=真;
Lastwas元音=真;
打破
}
else if(v==wc&!lastwas元音)
{
numowels++;
元音=真;
Lastwas元音=真;
打破
}
}
//如果完整循环且未找到元音,则将LastWas元音设置为false;
if(!found元音)
Lastwas元音=假;
}
//除去这些,通常都是无声的
如果(currentWord.Length>2&&
currentWord.Substring(currentWord.Length-2)=“es”)
numVowels——;
//删除静默e
如果(currentWord.Length>1),则为else&&
currentWord.Substring(currentWord.Length-1)=“e”)
numVowels——;
返回numVowels;
}
感谢Joe Basirico在C#中分享您的快速而肮脏的实现。我使用过大型库,它们很有效,但它们通常有点慢,对于快速项目,您的方法很有效
以下是您的Java代码以及测试用例:
public static int countSyllables(String word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
char[] currentWord = word.toCharArray();
int numVowels = 0;
boolean lastWasVowel = false;
for (char wc : currentWord) {
boolean foundVowel = false;
for (char v : vowels)
{
//don't count diphthongs
if ((v == wc) && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
// If full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
// Remove es, it's _usually? silent
if (word.length() > 2 &&
word.substring(word.length() - 2) == "es")
numVowels--;
// remove silent e
else if (word.length() > 1 &&
word.substring(word.length() - 1) == "e")
numVowels--;
return numVowels;
}
public static void main(String[] args) {
String txt = "what";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "super";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Maryland";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "American";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "disenfranchized";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Sophia";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}
结果与预期一致(对Flesch Kincaid来说效果足够好):
谢谢乔·巴西里科和蒂哈默。我已经将@tihamer的代码移植到Lua5.1、5.2和Luajit2(很可能也会在Lua的其他版本上运行):
count音节。lua
function CountSyllables(word)
local vowels = { 'a','e','i','o','u','y' }
local numVowels = 0
local lastWasVowel = false
for i = 1, #word do
local wc = string.sub(word,i,i)
local foundVowel = false;
for _,v in pairs(vowels) do
if (v == string.lower(wc) and lastWasVowel) then
foundVowel = true
lastWasVowel = true
elseif (v == string.lower(wc) and not lastWasVowel) then
numVowels = numVowels + 1
foundVowel = true
lastWasVowel = true
end
end
if not foundVowel then
lastWasVowel = false
end
end
if string.len(word) > 2 and
string.sub(word,string.len(word) - 1) == "es" then
numVowels = numVowels - 1
elseif string.len(word) > 1 and
string.sub(word,string.len(word)) == "e" then
numVowels = numVowels - 1
end
return numVowels
end
require "countsyllables"
tests = {
{ word = "what", syll = 1 },
{ word = "super", syll = 2 },
{ word = "Maryland", syll = 3},
{ word = "American", syll = 4},
{ word = "disenfranchized", syll = 5},
{ word = "Sophia", syll = 2},
{ word = "End", syll = 1},
{ word = "I", syll = 1},
{ word = "release", syll = 2},
{ word = "same", syll = 1},
}
for _,test in pairs(tests) do
local resultSyll = CountSyllables(test.word)
assert(resultSyll == test.syll,
"Word: "..test.word.."\n"..
"Expected: "..test.syll.."\n"..
"Result: "..resultSyll)
end
print("Tests passed.")
和一些有趣的测试来确认它的工作(尽可能多):
count音节。测试。lua
function CountSyllables(word)
local vowels = { 'a','e','i','o','u','y' }
local numVowels = 0
local lastWasVowel = false
for i = 1, #word do
local wc = string.sub(word,i,i)
local foundVowel = false;
for _,v in pairs(vowels) do
if (v == string.lower(wc) and lastWasVowel) then
foundVowel = true
lastWasVowel = true
elseif (v == string.lower(wc) and not lastWasVowel) then
numVowels = numVowels + 1
foundVowel = true
lastWasVowel = true
end
end
if not foundVowel then
lastWasVowel = false
end
end
if string.len(word) > 2 and
string.sub(word,string.len(word) - 1) == "es" then
numVowels = numVowels - 1
elseif string.len(word) > 1 and
string.sub(word,string.len(word)) == "e" then
numVowels = numVowels - 1
end
return numVowels
end
require "countsyllables"
tests = {
{ word = "what", syll = 1 },
{ word = "super", syll = 2 },
{ word = "Maryland", syll = 3},
{ word = "American", syll = 4},
{ word = "disenfranchized", syll = 5},
{ word = "Sophia", syll = 2},
{ word = "End", syll = 1},
{ word = "I", syll = 1},
{ word = "release", syll = 2},
{ word = "same", syll = 1},
}
for _,test in pairs(tests) do
local resultSyll = CountSyllables(test.word)
assert(resultSyll == test.syll,
"Word: "..test.word.."\n"..
"Expected: "..test.syll.."\n"..
"Result: "..resultSyll)
end
print("Tests passed.")
我找不到一种计算音节的合适方法,所以我自己设计了一种方法 您可以在此处查看我的方法: 我使用字典和算法相结合的方法来计算音节 您可以在此处查看我的库: 我刚刚测试了我的算法,有99.4%的命中率
lawrencelawrence=newlawrence();
System.out.println(lawrence.gethypel(“断字”);
System.out.println(lawrence.getSymplete(“计算机”);
输出:
4
3
trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']
撞上蒂哈默和乔·巴斯里科。非常有用的功能,不是完美的,但适合大多数中小型项目。Joe,我已经用Python重新编写了您的代码实现:
def计数音节(单词):
元音=“aeiouy”
numowels=0
Lastwas元音=假
对于word中的wc:
元音=假
对于元音中的v:
如果v==wc:
如果不是LastWas元音:numVowels+=1#不要计算双元音
foundvowell=lastWasVowel=True
打破
如果未找到元音:#如果整个周期未找到元音,则设置为最后一个
String hyphenedTerm = hyphenator.hyphenate(term);
String hyphens[] = hyphenedTerm.split("\u00AD");
int syllables = hyphens.length;
public String[] syllables(String text){
String url = "https://www.merriam-webster.com/dictionary/" + text;
String relHref;
try{
Document doc = Jsoup.connect(url).get();
Element link = doc.getElementsByClass("word-syllables").first();
if(link == null){return new String[]{text};}
relHref = link.html();
}catch(IOException e){
relHref = text;
}
String[] syl = relHref.split("·");
return syl;
}
from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops') # --> 4
countSyllablesInWord = function(words)
{
#word = "super";
n.words = length(words);
result = list();
for(j in 1:n.words)
{
word = words[j];
vowels = c("a","e","i","o","u","y");
word.vec = strsplit(word,"")[[1]];
word.vec;
n.char = length(word.vec);
is.vowel = is.element(tolower(word.vec), vowels);
n.vowels = sum(is.vowel);
# nontrivial problem
if(n.vowels <= 1)
{
syllables = 1;
str = word;
} else {
# syllables = 0;
previous = "C";
# on average ?
str = "";
n.hyphen = 0;
for(i in 1:n.char)
{
my.char = word.vec[i];
my.vowel = is.vowel[i];
if(my.vowel)
{
if(previous == "C")
{
if(i == 1)
{
str = paste0(my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
if(i < n.char)
{
if(n.vowels > (n.hyphen + 1))
{
str = paste0(str, my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
}
}
# syllables = 1 + syllables;
previous = "V";
} else { # "VV"
# assume what ? vowel team?
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
previous = "C";
}
#
}
syllables = 1 + n.hyphen;
}
result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
}
if(n.words == 1) { result[[1]]; } else { result; }
}
my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));
my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);
my.count.df;
# syllables vowels word
# 1 4 4 A-me-ri-ca
# 2 4 5 be-auti-fu-l
# 3 3 4 spa-ci-ous
# 4 2 2 ski-es
# 5 2 2 a-mber
# 6 2 2 wa-ves
# 7 2 2 gra-in
# 8 2 2 pu-rple
# 9 3 4 mo-unta-ins
# 10 3 3 ma-je-sty
################ hackathon #######
# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word
# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/
# https://enchantedlearning.com/consonantblends/index.shtml
# start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr",
# "fl", "fr", "gl", "gr", "pl", "pr",
# "sc", "sh", "sk", "sl", "sm", "sn",
# "sp", "st", "sw", "th", "tr", "tw",
# "wh", "wr");
# start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
# "spr", "squ", "str", "thr");
#
#
#
# end.digraphs = c("ch","sh","th","ng","dge","tch");
#
# ile
#
# farmer
# ar er
#
# vowel teams ... beaver1
#
#
# # "able"
# # http://www.abcfastphonics.com/letter-blends/blend-cial.html
# blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian",
# "ck", "ct", "dge", "dis", "ed", "ex", "ful",
# "gh", "ng", "ous", "kn", "ment", "mis", );
#
# glue = c("ld", "st", "nd", "ld", "ng", "nk",
# "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch",
# "nse", "nt", "ph", "psy", "pt", "re", )
#
#
# start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
#
# # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
# digraphs.start = c("ch","sh","th","wh","ph","qu");
# digraphs.end = c("ch","sh","th","ng","dge","tch");
# # https://www.education.com/worksheet/article/beginning-consonant-blends/
# blends.start = c("pl", "gr", "gl", "pr",
#
# blends.end = c("lk","nk","nt",
#
#
# # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
# # Monte Mon-te
# # Sophia So-phi-a
# # American A-mer-i-can
#
# n.vowels = 0;
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
#
#
#
#
# n.syll = 0;
# str = "";
#
# previous = "C"; # consonant vs "V" vowel
#
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
# my.vowel = is.element(tolower(my.char), vowels);
# if(my.vowel)
# {
# n.vowels = 1 + n.vowels;
# if(previous == "C")
# {
# if(i == 1)
# {
# str = paste0(my.char, "-");
# } else {
# if(n.syll > 1)
# {
# str = paste0(str, "-", my.char);
# } else {
# str = paste0(str, my.char);
# }
# }
# n.syll = 1 + n.syll;
# previous = "V";
# }
#
# } else {
# str = paste0(str, my.char);
# previous = "C";
# }
# #
# }
#
#
#
#
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE 1
# IDEA 3
# IDEAS 2
# IDEE 2
# IDE 1
# AIDA 2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE 1
# HALIDE 2
# TELEPHONE 3
# TELEPHONY 4
# DUE 1
# IDEAL 2
# DEE 1
# UREA 3
# VACUO 3
# SEANCE 1
# SAILED 1
# RIBBED 1
# MOPED 1
# BLESSED 1
# AGED 1
# TOTED 2
# WARRED 1
# UNDERFED 2
# JADED 2
# INBRED 2
# BRED 1
# RED 1
# STATES 1
# TASTES 1
# TESTES 1
# UTILIZES 4
computeReadability = function(n.sentences, n.words, syllables=NULL)
{
n = length(syllables);
n.syllables = 0;
for(i in 1:n)
{
my.syllable = syllables[[i]];
n.syllables = my.syllable$syllables + n.syllables;
}
# Flesch Reading Ease (FRE):
FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
# Flesh-Kincaid Grade Level (FKGL):
FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59;
# FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
# FKGL = -0.13948 * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
list("FRE" = FRE, "FKGL" = FKGL);
}
trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']