Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在java中删除字符串中的阿拉伯语标点符号_Java_Arabic - Fatal编程技术网

如何在java中删除字符串中的阿拉伯语标点符号

如何在java中删除字符串中的阿拉伯语标点符号,java,arabic,Java,Arabic,我正在编写一本阿拉伯语词典,我得到了类似 弦乐原创语“‘㵢1614;َََللنُاً:عََههله; 从我的数据库,但我不能处理的句子不删除重音和标点符号 我试着用 import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Pattern; public static String deAccent(String str) { String nfdNormalizedS

我正在编写一本阿拉伯语词典,我得到了类似
弦乐原创语“‘㵢1614;َََللنُاً:عََههله; 从我的数据库,但我不能处理的句子不删除重音和标点符号

我试着用

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public static String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
} 

但是它不起作用

为什么不选择Unicode标点/标记,非间距类别

不确定您的预期结果,因为它未发布-我无法阅读阿拉伯语:),但请尝试以下代码:

String input = "'أَبَنَ فُلانًا: عَابَه ورَمَاه بخَلَّة سَوء.'";
Pattern p = Pattern.compile("[\\p{P}\\p[Mn]");
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("found: " + m.group());
}
m.reset();
System.out.println("Replaced: " + m.replaceAll(" "));
输出:

found: '
found: َ
found: َ
found: َ
found: ُ
found: ً
found: :
found: َ
found: َ
found: َ
found: َ
found: َ
found: ّ
found: َ
found: َ
found: .
found: '
Replaced:  أ ب ن  ف لان ا  ع اب ه ور م اه بخ ل  ة س وء  
我想这不是你想要的最终结果,但我希望这是你可以处理的事情

此外,还有一个有关Unicode类别信息的金矿。我相信大多数都适用于Java
模式

尝试一下,它在我的项目中运行良好:

/**
 * ArabicNormalizer class
 * @author Ibrabel <ibrabel@gmail.com>
 */
public final class ArabicNormalizer {

    private String input;
    private final String output;

    /**
     * ArabicNormalizer constructor
     * @param input String
     */
    public ArabicNormalizer(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        //Remove honorific sign
        input=input.replaceAll("\u0610", "");//ARABIC SIGN SALLALLAHOU ALAYHE WA SALLAM
        input=input.replaceAll("\u0611", "");//ARABIC SIGN ALAYHE ASSALLAM
        input=input.replaceAll("\u0612", "");//ARABIC SIGN RAHMATULLAH ALAYHE
        input=input.replaceAll("\u0613", "");//ARABIC SIGN RADI ALLAHOU ANHU
        input=input.replaceAll("\u0614", "");//ARABIC SIGN TAKHALLUS

        //Remove koranic anotation
        input=input.replaceAll("\u0615", "");//ARABIC SMALL HIGH TAH
        input=input.replaceAll("\u0616", "");//ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        input=input.replaceAll("\u0617", "");//ARABIC SMALL HIGH ZAIN
        input=input.replaceAll("\u0618", "");//ARABIC SMALL FATHA
        input=input.replaceAll("\u0619", "");//ARABIC SMALL DAMMA
        input=input.replaceAll("\u061A", "");//ARABIC SMALL KASRA
        input=input.replaceAll("\u06D6", "");//ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D7", "");//ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D8", "");//ARABIC SMALL HIGH MEEM INITIAL FORM
        input=input.replaceAll("\u06D9", "");//ARABIC SMALL HIGH LAM ALEF
        input=input.replaceAll("\u06DA", "");//ARABIC SMALL HIGH JEEM
        input=input.replaceAll("\u06DB", "");//ARABIC SMALL HIGH THREE DOTS
        input=input.replaceAll("\u06DC", "");//ARABIC SMALL HIGH SEEN
        input=input.replaceAll("\u06DD", "");//ARABIC END OF AYAH
        input=input.replaceAll("\u06DE", "");//ARABIC START OF RUB EL HIZB
        input=input.replaceAll("\u06DF", "");//ARABIC SMALL HIGH ROUNDED ZERO
        input=input.replaceAll("\u06E0", "");//ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        input=input.replaceAll("\u06E1", "");//ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        input=input.replaceAll("\u06E2", "");//ARABIC SMALL HIGH MEEM ISOLATED FORM
        input=input.replaceAll("\u06E3", "");//ARABIC SMALL LOW SEEN
        input=input.replaceAll("\u06E4", "");//ARABIC SMALL HIGH MADDA
        input=input.replaceAll("\u06E5", "");//ARABIC SMALL WAW
        input=input.replaceAll("\u06E6", "");//ARABIC SMALL YEH
        input=input.replaceAll("\u06E7", "");//ARABIC SMALL HIGH YEH
        input=input.replaceAll("\u06E8", "");//ARABIC SMALL HIGH NOON
        input=input.replaceAll("\u06E9", "");//ARABIC PLACE OF SAJDAH
        input=input.replaceAll("\u06EA", "");//ARABIC EMPTY CENTRE LOW STOP
        input=input.replaceAll("\u06EB", "");//ARABIC EMPTY CENTRE HIGH STOP
        input=input.replaceAll("\u06EC", "");//ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        input=input.replaceAll("\u06ED", "");//ARABIC SMALL LOW MEEM

        //Remove tatweel
        input=input.replaceAll("\u0640", "");

        //Remove tashkeel
        input=input.replaceAll("\u064B", "");//ARABIC FATHATAN
        input=input.replaceAll("\u064C", "");//ARABIC DAMMATAN
        input=input.replaceAll("\u064D", "");//ARABIC KASRATAN
        input=input.replaceAll("\u064E", "");//ARABIC FATHA
        input=input.replaceAll("\u064F", "");//ARABIC DAMMA
        input=input.replaceAll("\u0650", "");//ARABIC KASRA
        input=input.replaceAll("\u0651", "");//ARABIC SHADDA
        input=input.replaceAll("\u0652", "");//ARABIC SUKUN
        input=input.replaceAll("\u0653", "");//ARABIC MADDAH ABOVE
        input=input.replaceAll("\u0654", "");//ARABIC HAMZA ABOVE
        input=input.replaceAll("\u0655", "");//ARABIC HAMZA BELOW
        input=input.replaceAll("\u0656", "");//ARABIC SUBSCRIPT ALEF
        input=input.replaceAll("\u0657", "");//ARABIC INVERTED DAMMA
        input=input.replaceAll("\u0658", "");//ARABIC MARK NOON GHUNNA
        input=input.replaceAll("\u0659", "");//ARABIC ZWARAKAY
        input=input.replaceAll("\u065A", "");//ARABIC VOWEL SIGN SMALL V ABOVE
        input=input.replaceAll("\u065B", "");//ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
        input=input.replaceAll("\u065C", "");//ARABIC VOWEL SIGN DOT BELOW
        input=input.replaceAll("\u065D", "");//ARABIC REVERSED DAMMA
        input=input.replaceAll("\u065E", "");//ARABIC FATHA WITH TWO DOTS
        input=input.replaceAll("\u065F", "");//ARABIC WAVY HAMZA BELOW
        input=input.replaceAll("\u0670", "");//ARABIC LETTER SUPERSCRIPT ALEF

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicNormalizer(test).getOutput();
        System.out.println("After: "+test);
    }
}
/**
*阿拉伯标准化剂类
*@作者伊布拉贝尔
*/
公共最终类阿拉伯标准化器{
私有字符串输入;
私有最终字符串输出;
/**
*阿拉伯正规化构造器
*@param输入字符串
*/
公用阿拉伯标准化器(字符串输入){
这个输入=输入;
this.output=normalize();
}
/**
*规范化方法
*@返回字符串
*/
私有字符串规范化(){
//去掉敬语
input=input.replaceAll(“\u0610”,”);//阿拉伯文符号SALLALAHOU ALAYHE WA SALLAM
input=input.replaceAll(“\u0611”,”);//阿拉伯文符号ALAYHE-ASSALLAM
input=input.replaceAll(“\u0612”,”);//阿拉伯文符号RAHMATULLAH ALAYHE
input=input.replaceAll(“\u0613”,”);//阿拉伯符号RADI ALLAHOU ANHU
input=input.replaceAll(“\u0614”,”);//阿拉伯符号Takhellus
//除去可兰经涂油膏
input=input.replaceAll(“\u0615”,”);//阿拉伯文小高
input=input.replaceAll(“\u0616”,”);//阿拉伯文小写高连字ALEF随以LAM,再随以YEH
input=input.replaceAll(“\u0617”,”);//阿拉伯文小高赞
input=input.replaceAll(“\u0618”,”);//阿拉伯文小法塔
input=input.replaceAll(“\u0619”,”);//阿拉伯文小达玛
input=input.replaceAll(“\u061A”,”);//阿拉伯文小卡斯拉
input=input.replaceAll(“\u06D6”,”);//阿拉伯文小写高连字SAD随以LAM,随以ALEF MAKSURA
input=input.replaceAll(“\u06D7”,”);//阿拉伯文小写高连字QAF随以LAM,再随以ALEF MAKSURA
input=input.replaceAll(“\u06D8”,”);//阿拉伯文小写高级MEEM首字母形式
input=input.replaceAll(“\u06D9”,”);//阿拉伯文小调高音
input=input.replaceAll(“\u06DA”,”);//阿拉伯语小高音吉姆
input=input.replaceAll(“\u06DB”,”);//阿拉伯文小三点
input=input.replaceAll(“\u06DC”,”);//阿拉伯文小高位
input=input.replaceAll(“\u06DD”,”);//阿拉伯文AYAH结尾
input=input.replaceAll(“\u06DE”,”);//以阿拉伯语开头的RUB EL-HIZB
input=input.replaceAll(“\u06DF”,”);//阿拉伯文小高四舍五入零
input=input.replaceAll(“\u06E0”,”);//阿拉伯文小高竖直矩形零
input=input.replaceAll(“\u06E1”,”);//阿拉伯小型高无圆点卡赫头像
input=input.replaceAll(“\u06E2”,”);//阿拉伯文小型高级MEEM独立形式
input=input.replaceAll(“\u06E3”,”);//阿拉伯文小写
input=input.replaceAll(“\u06E4”,”);//阿拉伯文小调高音符
input=input.replaceAll(“\u06E5”,”);//阿拉伯文小写WAW
input=input.replaceAll(“\u06E6”,”);//阿拉伯文小写YEH
input=input.replaceAll(“\u06E7”,”);//阿拉伯文小高YEH
input=input.replaceAll(“\u06E8”,”);//阿拉伯语小正午
input=input.replaceAll(“\u06E9”,”);//萨伊达的阿拉伯语位置
input=input.replaceAll(“\u06EA”,”);//阿拉伯文空中心低止点
input=input.replaceAll(“\u06EB”,”);//阿拉伯文空中心高止点
input=input.replaceAll(“\u06EC”,”);//阿拉伯文圆形高位挡块,中间填充
input=input.replaceAll(“\u06ED”,”);//阿拉伯文小写字母
//除去塔特维尔
input=input.replaceAll(“\u0640”,”);
//移除塔什干
input=input.replaceAll(“\u064B”,”);//阿拉伯文法塔坦语
input=input.replaceAll(“\u064C”,”);//阿拉伯语达马坦语
input=input.replaceAll(“\u064D”,”);//阿拉伯语KASRATAN
input=input.replaceAll(“\u064E”,”);//阿拉伯文法塔
input=input.replaceAll(“\u064F”,”);//阿拉伯文达玛
input=input.replaceAll(“\u0650”和“”);//阿拉伯语KASRA
input=input.replaceAll(“\u0651”,”);//阿拉伯语SHADDA
input=input.replaceAll(“\u0652”,”);//阿拉伯语
input=input.replaceAll(“\u0653”,”);//上面是阿拉伯语的玛达语
input=input.replaceAll(“\u0654”,”);//上面的阿拉伯文字母
input=input.replaceAll(“\u0655”,”);//下面是阿拉伯文字母
input=input.replaceAll(“\u0656”,”);//阿拉伯文下标ALEF
input=input.replaceAll(“\u0657”,”);//阿拉伯文倒达玛
input=input.replaceAll(“\u0658”,”);//阿拉伯语标记NOON GHUNNA
input=input.replaceAll(“\u0659”,”);//阿拉伯语ZWARAKAY
input=input.replaceAll(“\u065A”,”);//上面的阿拉伯元音符号小V
input=input.replaceAll(“\u065B”,”);//上面的阿拉伯元音符号倒小V
input=input.replaceAll(“\u065C”,”);//下面的阿拉伯元音符号点
input=input.replaceAll(“\u065D”,”);//阿拉伯文倒达玛
input=input.replaceAll(“\u065E”,”);//带两点的阿拉伯文法塔
input=input.replaceAll(“\u065F”和“”);//下面是阿拉伯波浪形火腿
input=input.replaceAll(“\u0670”,”);//阿拉伯文字母上标ALEF
返回输入;
}
/**
*@返回输出
*/
公共字符串getOutput(){
返回输出;
}
公共静态void main(字符串[]args){
弦乐测试;
System.out.println(“之前:+测试);
测试=新的ArabicNorm
import java.text.Normalizer;
import java.text.Normalizer.Form;

/**
 *
 * @author Ibbtek <http://ibbtek.altervista.org/>
 */
public class ArabicDiacritics {

    private String input;
    private final String output;

    /**
     * ArabicDiacritics constructor
     * @param input String
     */
    public ArabicDiacritics(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        input = Normalizer.normalize(input, Form.NFKD)
                .replaceAll("\\p{M}", "");

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicDiacritics(test).getOutput();
        System.out.println("After: "+test);
    }
}
String withDiacritics = "طَائِفِيّةٌ";
String withoutDiacritics = withDiacritics.replaceAll("(ّ)?(َ)?(ً)?(ُ)?(ٌ)?(ِ)?(ٍ)?(~)?(ْ)?", "");
String diacless = Normalizer.normalize(textWithDiacritics, Normalizer.Form.NFKD).replaceAll("\\p{M}", "");
Log.d("diac_remove", "replaced: "+diacless);