在Java中获取语言的unicode字符

在Java中获取语言的unicode字符,java,unicode,character-properties,Java,Unicode,Character Properties,Java中是否有任何方法可以获取特定语言(例如孟加拉语或阿拉伯语)的所有Unicode字符?Java.lang.Character类有一个称为Unicode块的内部静态类。例如,您可以通过以下方式获取阿拉伯文Unicode块: Character.UnicodeBlock block = Character.UnicodeBlock.ARABIC; 通过迭代所有字符(或更准确地说,Unicode代码点),可以检查每个字符以找到其Unicode块: public static void main

Java中是否有任何方法可以获取特定语言(例如孟加拉语或阿拉伯语)的所有Unicode字符?

Java.lang.Character类有一个称为Unicode块的内部静态类。例如,您可以通过以下方式获取阿拉伯文Unicode块:

Character.UnicodeBlock block = Character.UnicodeBlock.ARABIC;
通过迭代所有字符(或更准确地说,Unicode代码点),可以检查每个字符以找到其Unicode块:

public static void main(String[] args) {
    Set<Character> arabicChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.ARABIC);
    Set<Character> bengaliChars = findCharactersInUnicodeBlock(Character.UnicodeBlock.BENGALI);
}

private static Set<Character> findCharactersInUnicodeBlock(final Character.UnicodeBlock block) {
    final Set<Character> chars = new HashSet<Character>();
    for (int codePoint = Character.MIN_CODE_POINT; codePoint <= Character.MAX_CODE_POINT; codePoint++) {
        if (block == Character.UnicodeBlock.of(codePoint)) {
            chars.add((char) codePoint);
        }
    }
    return chars;
}
publicstaticvoidmain(字符串[]args){
设置arabicChars=findCharactersInUnicodeBlock(Character.UnicodeBlock.ARABIC);
设置bengaliChars=findCharactersInUnicodeBlock(Character.UnicodeBlock.BENGALI);
}
私有静态集FindCharactersUnicode块(final Character.Unicode块){
final Set chars=new HashSet();

对于(int codePoint=Character.MIN_CODE_POINT;codePoint直到1.7,Java都不支持Unicode脚本。不过Java对Unicode属性的支持非常粗略。它基本上停留在Unicode的前千年版本。这是一个真正的问题。他们声称他们会用JDK7赶上Unicode 6,但我还没有看到任何证据他们将得到适当的财产支持

在Unicode 6.0中,总共有1051个代码点算作阿拉伯语,其中1020个代码点算作基本多语言:

% unichars --bmp  '\p{Script=Arabic}' | wc -l
    1020

% unichars -a '\p{Script=Arabic}' | wc -l
    1051
之所以有效,是因为
unichars
程序是用Perl编写的,Perl一直都有很好的Unicode属性支持。我在Unicode 6.0上运行它;在以前的Unicode版本中,它的数量有所减少。事实上,Unicode 6.0添加了17个新的阿拉伯字符:

 % unichars -a '\p{Script=Arabic}' '\p{Age:6.0}' | wc -l
         17
您不能尝试使用块来执行此操作。脚本与块不同。并非给定块中的所有代码点都具有相同的脚本。同样重要的是,您经常会发现给定脚本的字符分散在奇怪的块中

例如,希腊文块中有18个非希腊文字符:

% unichars '\p{InGreek}' '\P{IsGreek}'ˋ | wc -l
     18
以及阿拉伯语块中的13个非阿拉伯语字符:

% unichars '\p{InArabic}' '\P{IsArabic}' | wc -l
     13
另外还有4个希腊语块和4个(或5个)阿拉伯语块:

% uniprops -l | grep 'Block:.*Greek'
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block:Greek_And_Coptic
Block:Greek_Extended

% uniprops -l | grep 'Block:.*Arab'
Block:Arabic
Block:Arabic_Presentation_Forms_A
Block:Arabic_Presentation_Forms_B
Block:Arabic_Supplement 
Block:Old_South_Arabian
\p{Block:Greek}
\p{Greek_和_Coptic}
是别名,但其余的都是不同的

但即使您查看所有这些块,也会遗漏一些。例如:

% unichars '\p{IsGreek}' '[^\p{InAncient_Greek_Musical_Notation}\p{InAncient_Greek_Numbers}\p{InGreek}\p{InGreek_Extended}]' 
 ᴦ  7462 1D26 GREEK LETTER SMALL CAPITAL GAMMA
 ᴧ  7463 1D27 GREEK LETTER SMALL CAPITAL LAMDA
 ᴨ  7464 1D28 GREEK LETTER SMALL CAPITAL PI
 ᴩ  7465 1D29 GREEK LETTER SMALL CAPITAL RHO
 ᴪ  7466 1D2A GREEK LETTER SMALL CAPITAL PSI
 ᵝ  7517 1D5D MODIFIER LETTER SMALL BETA
 ᵞ  7518 1D5E MODIFIER LETTER SMALL GREEK GAMMA
 ᵟ  7519 1D5F MODIFIER LETTER SMALL DELTA
 ᵠ  7520 1D60 MODIFIER LETTER SMALL GREEK PHI
 ᵡ  7521 1D61 MODIFIER LETTER SMALL CHI
 ᵦ  7526 1D66 GREEK SUBSCRIPT SMALL LETTER BETA
 ᵧ  7527 1D67 GREEK SUBSCRIPT SMALL LETTER GAMMA
 ᵨ  7528 1D68 GREEK SUBSCRIPT SMALL LETTER RHO
 ᵩ  7529 1D69 GREEK SUBSCRIPT SMALL LETTER PHI
 ᵪ  7530 1D6A GREEK SUBSCRIPT SMALL LETTER CHI
 ᶿ  7615 1DBF MODIFIER LETTER SMALL THETA
 Ω  8486 2126 OHM SIGN
看到问题了吗

顺便说一句,您使用
uniprops
不仅仅是为了列出所有可能的属性。它还可以为您提供任何给定代码点的属性:

% uniprops -a 1dbf 9e6 NEL Greek:Omicron
U+1DBF <ᶿ> \N{ MODIFIER LETTER SMALL THETA }:
    \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InPhoneticExtensionsSupplement Case_Ignorable CI Cased Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase Grek ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    Age:4.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Phonetic_Extensions_Supplement Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
       Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Sup Decomposition_Type=Super Decomposition_Type:Super Dt=Sup East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:4.1
       In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:LO Sentence_Break=Lower Sentence_Break:Lower SB=LO Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter
U+09E6 <০> \N{ BENGALI DIGIT ZERO }:
    \w \d \pN \p{Nd}
    All Any Alnum Assigned Beng Bengali InBengali Is_Bengali Decimal_Number Digit Nd N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC Number Print Word XID_Continue XIDC
    Age:1.1 Script=Bengali Block=Bengali Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Bengali Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None East_Asian_Width=Neutral
       East_Asian_Width:Neutral Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U
       Joining_Type=Non_Joining Line_Break:NU Line_Break=Numeric Line_Break:Numeric Lb=NU Numeric_Type:De Numeric_Type=Decimal Numeric_Type:Decimal Nt=De Numeric_Value:0 Nv=0 Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
       Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Beng Script:Bengali Sc=Beng Sentence_Break:NU Sentence_Break=Numeric Sentence_Break:Numeric SB=NU Word_Break:NU Word_Break=Numeric Word_Break:Numeric WB=NU
U+0085 <U+0085> \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace White_Space WSpace
    Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1 Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Script=Common
       Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL Word_Break:NL Word_Break=Newline
U+039F <Ο> \N{ GREEK CAPITAL LETTER OMICRON }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper
       Uppercase Word XID_Continue XIDC XID_Start XIDS
    Age:1.1 Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:Greek Block=Greek_And_Coptic Block:Greek_And_Coptic Blk=Greek Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
       East_Asian_Width:A East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Script=Greek Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Greek Sc=Grek Script:Grek Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter
%uniprops-1dbf 9e6 NEL希腊文:Omicron
U+1DBF\N{修饰符字母小写θ}:
\w\pL\p{L}\p{Lm}
所有Alnum Alpha字母指定的希腊文都是希腊字母音素扩展补充大小写可忽略的CI大小写更改当大小写折叠时CWKCF L Lm Gr Gr Gr基础图形GrBase Grek ID继续IDC ID开始ID字母L修饰符字母小写打印字XID继续XID XID开始XID
年龄:4.1比迪类:左比迪类=左比迪类:左比迪类=右比迪类:左比迪类:左比迪类=右比迪类:语音扩展类:补充规范组合类:0规范组合类=未重新排序的规范组合类:未重新排序的Ccc=NR规范组合类:NR分解类型:非规范分解类型=非规范
分解类型:非正则Dt=非密码子分解类型:超级分解类型=超级分解类型:超级Dt=超级东亚宽度=中性东亚宽度:中性字素簇破译:其他GCB=XX字素簇破译:XX字素簇破译=其他脚本=希腊文韩语音节破译:NA
韩国语音节类型=不适用韩国语音节类型:不适用Hst=不加入组Jg=不加入组加入类型:不加入Jt=不加入类型:不加入类型:不加入类型=不加入行中断:字母行中断=字母行中断:字母Lb=字母数字类型:无Nt=无数字值:NaN Nv=不存在4.1
In=4.1 Present\u In:5.0 In=5.0 Present\u In:5.1 In=5.1 Present\u In:5.2 In=5.2脚本:希腊语Sc=Grek脚本:Grek语句:LO-statement\u Break=low-statement\u Break=low-Word
U+09E6\N{孟加拉数字零}:
\w\d\pN\p{Nd}
所有在英语中指定的孟加拉语字母都是孟加拉语十进制数字数字Nd N Gr Gr Gr Gr Gr Gr Gr Base ID GRU Continue IDC数字打印字XID\U Continue XIDC
年龄:1.1脚本=孟加拉语块=孟加拉语比迪类:L比迪类=左比迪类:左比迪类:左比迪类:左比迪类:孟加拉语规范组合类:0规范组合类=未重新排序规范组合类:未重新排序Ccc=NR规范组合类:NR分解类型:无Dt=无东亚宽度=中性
东亚字宽:中性字群断续:其他GCB=XX字群断续:XX字群断续=其他韩国语音节断续类型:NA韩国语音节断续类型=不适用韩国语音节断续类型:不适用Hst=NA加入组:不加入组Jg=不加入组类型:不加入Jt=U加入类型:U
连接类型=非连接线\u中断:NU线\u中断=数字线\u中断:数字Lb=NU数字\u类型:反数字\u类型=十进制数字\u类型:十进制Nt=反数字\u值:0 Nv=0当前\u In:1.1年龄=1.1当前\u In:2.0当前\u In:2.1当前\u In=2.1当前\u In:3.1当前\u In:3.2当前\u In:4.0
现在时:4.1in=4.1现在时:5.0in=5.0现在时:5.1in=5.1现在时:5.2in=5.2脚本:孟加拉语Sc=Beng句子
U+0085\N{下一行(NEL)}:
\s\v\R\pC\p{Cc}
所有指定的嵌入1 C其他Cc Cntrl公共Zyyy控制模式白色空间PatWS空间PERL垂直空间白色空间WSpace
年龄:1.1 Bidi_类:B Bidi_类=段落分隔符Bidi_类:段落分隔符Bc=B块:拉丁语块=拉丁语块:拉丁语块=拉丁语块:拉丁语块=拉丁语块