Java ApacheTikaparser抛出不可跟踪的异常

Java ApacheTikaparser抛出不可跟踪的异常,java,exception,apache-tika,Java,Exception,Apache Tika,我目前正在尝试开发一个工具,它使用ApacheTikaparser从不同的文件中提取内容。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常: Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init> WARNING: Invalid format: "_([$€-2]\ * "-"_);" java.lang.IllegalArgumentException: Unsupported [] f

我目前正在尝试开发一个工具,它使用ApacheTikaparser从不同的文件中提取内容。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常:

Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
        at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
        at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
        at attproc.Main.lambda$main$0(Main.java:89)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)
“tikaConfig”是一个单例对象:

public class TikaConfiguration {
    private final TikaConfig tikaConfig;
    public final PDFParserConfig pdfConfig;
    public final Parser autoDetectParser;

    private static TikaConfiguration instance;

    private TikaConfiguration() throws Exception {
        ClassLoader classLoader = getClass().getClassLoader();
        InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
        this.tikaConfig = new TikaConfig(stream);
        this.pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(false);

        tikaConfig.getDetector();
        autoDetectParser = new AutoDetectParser(tikaConfig);
    }

    public static TikaConfiguration setConfiguration() {
        if (TikaConfiguration.instance == null) {
            try {
                TikaConfiguration.instance = new TikaConfiguration();
            } catch (Exception ignored) {}
        }

        return TikaConfiguration.instance;
    }
}
要捕获此异常,我必须做些什么?

请查看一些旧线程。你所看到的看起来非常相似。这表明Tika用于解析Excel的POI库抛出的是警告,而不是错误(您的日志输出也反映了这一点)。该警告恰好在其日志记录中包含一个堆栈跟踪(我假设由POI捕获,然后传递给Tika)

因此,您的代码不会捕获该警告(它不是引发的异常)

正如一位评论者在JIRA中提到的:

我甚至不确定这是不是一个错误。这是POILOGER的输出,而不是,例如printStackTrace()

无论其状态是否为bug,JIRA中也提出了一种解决方法:在运行应用程序时,将err流重定向到null(提供了一个示例)

我下载了JIRA附带的电子表格,我能够重新创建您的邮件的版本:

WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
    at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
    at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
    at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
    at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...
警告:无效的格式:“\u([$cha-2]\*\35;,\ u35;\ 0.00_35;”
java.lang.IllegalArgumentException:不受支持的[]格式块“['in'”([$ch-2]\*.\35;,###0.00#)和c2:null
位于org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
位于org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
位于org.apache.poi.ss.format.CellFormatPart.(CellFormatPart.java:191)
位于org.apache.poi.ss.format.CellFormat.(CellFormat.java:193)
...

但是,我的程序成功完成。它继续正确地生成其输出。

健全性检查:读取Excel文件(“Apache POI”SS堆栈跟踪)时引发错误;并且有一个引用(至少按名称)PDF配置对象的解析器(
PDFParserConfig()
)。这是故意的吗?我本以为这个过程已经有了Excel文件和PDF文件的单独处理程序。我可能缺少一些基本的东西,但这看起来很奇怪。我显式地设置了PDFConfig,因为我分别为每个PDF文档启用OCR。但即使没有这个特定的配置,Tika在我ile不知何故被破坏或损坏。经过一些尝试和错误之后,我找到了一个解决方案:我使用main方法中的行“LogManager.getLogManager().reset();”来完全禁用日志记录。
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
    at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
    at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
    at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
    at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...