Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/364.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 使用Apache Tika从text/PDF中删除特殊字符_Java_Character Encoding_Special Characters_Apache Tika_Text Decoding - Fatal编程技术网

Java 使用Apache Tika从text/PDF中删除特殊字符

Java 使用Apache Tika从text/PDF中删除特殊字符,java,character-encoding,special-characters,apache-tika,text-decoding,Java,Character Encoding,Special Characters,Apache Tika,Text Decoding,我正在解析PDF文件以使用ApacheTika提取文本 //Create a body content handler BodyContentHandler handler = new BodyContentHandler(); //Metadata Metadata metadata = new Metadata(); //Input file path FileInputStream inputstream = new FileInputStream(new File(faInputFi

我正在解析PDF文件以使用ApacheTika提取文本

//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();

//Metadata
Metadata metadata = new Metadata();

//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));

//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();

try
{       
    //parsing the document using PDF parser from Tika.
    PDFParser pdfparser = new PDFParser();

    //Do the parsing by calling the parse function of pdfparser
    pdfparser.parse(inputstream, handler, metadata,pcontext);

}catch(Exception e)
{
    System.out.println("Exception caught:");
}
String extractedText = handler.toString();
上面的代码有效,PDF中的文本已被删除


PDF文件中有一些特殊字符(如@/&/£或商标符号等)。如何在提取过程中或提取后删除这些特殊字符?

PDF使用unicode代码点您可能有包含代理项对、组合形式(例如用于变音符号)等的字符串,并且可能希望将这些字符串保留为最接近的ASCII等效字符串,例如将
é
归一化为
e
。如果是这样,您可以这样做:

import java.text.Normalizer;

String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);
如果您只是在ASCII文本之后,那么一旦标准化,您可以使用正则表达式过滤从Tika获得的字符串,如下所示:

但是,由于正则表达式可能很慢(特别是在大字符串上),您可能希望避免使用正则表达式,并进行简单的替换(如所示):

公共静态字符串(字符串){
char[]out=新字符[string.length()];
String normalized=Normalizer.normalize(String,Normalizer.Form.NFD);
int j=0;
对于(int i=0,n=normalized.length();i如果(c)字符串上带有正则表达式?带?
extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");
public static String flattenToAscii(String string) {
    char[] out = new char[string.length()];
    String normalized = Normalizer.normalize(string, Normalizer.Form.NFD);
    int j = 0;
    for (int i = 0, n = normalized.length(); i < n; ++i) {
        char c = normalized.charAt(i);
        if (c <= '\u007F') out[j++] = c;
    }
    return new String(out);
}