Java ApacheTikaparser抛出不可跟踪的异常
我目前正在尝试开发一个工具,它使用ApacheTikaparser从不同的文件中提取内容。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常:Java ApacheTikaparser抛出不可跟踪的异常,java,exception,apache-tika,Java,Exception,Apache Tika,我目前正在尝试开发一个工具,它使用ApacheTikaparser从不同的文件中提取内容。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常: Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init> WARNING: Invalid format: "_([$€-2]\ * "-"_);" java.lang.IllegalArgumentException: Unsupported [] f
Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
at attproc.Main.lambda$main$0(Main.java:89)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
“tikaConfig”是一个单例对象:
public class TikaConfiguration {
private final TikaConfig tikaConfig;
public final PDFParserConfig pdfConfig;
public final Parser autoDetectParser;
private static TikaConfiguration instance;
private TikaConfiguration() throws Exception {
ClassLoader classLoader = getClass().getClassLoader();
InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
this.tikaConfig = new TikaConfig(stream);
this.pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);
tikaConfig.getDetector();
autoDetectParser = new AutoDetectParser(tikaConfig);
}
public static TikaConfiguration setConfiguration() {
if (TikaConfiguration.instance == null) {
try {
TikaConfiguration.instance = new TikaConfiguration();
} catch (Exception ignored) {}
}
return TikaConfiguration.instance;
}
}
要捕获此异常,我必须做些什么?请查看一些旧线程。你所看到的看起来非常相似。这表明Tika用于解析Excel的POI库抛出的是警告,而不是错误(您的日志输出也反映了这一点)。该警告恰好在其日志记录中包含一个堆栈跟踪(我假设由POI捕获,然后传递给Tika)
因此,您的代码不会捕获该警告(它不是引发的异常)
正如一位评论者在JIRA中提到的:
我甚至不确定这是不是一个错误。这是POILOGER的输出,而不是,例如printStackTrace()
无论其状态是否为bug,JIRA中也提出了一种解决方法:在运行应用程序时,将err流重定向到null(提供了一个示例)
我下载了JIRA附带的电子表格,我能够重新创建您的邮件的版本:
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...
警告:无效的格式:“\u([$cha-2]\*\35;,\ u35;\ 0.00_35;”
java.lang.IllegalArgumentException:不受支持的[]格式块“['in'”([$ch-2]\*.\35;,###0.00#)和c2:null
位于org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
位于org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
位于org.apache.poi.ss.format.CellFormatPart.(CellFormatPart.java:191)
位于org.apache.poi.ss.format.CellFormat.(CellFormat.java:193)
...
但是,我的程序成功完成。它继续正确地生成其输出。健全性检查:读取Excel文件(“Apache POI”SS堆栈跟踪)时引发错误;并且有一个引用(至少按名称)PDF配置对象的解析器(
PDFParserConfig()
)。这是故意的吗?我本以为这个过程已经有了Excel文件和PDF文件的单独处理程序。我可能缺少一些基本的东西,但这看起来很奇怪。我显式地设置了PDFConfig,因为我分别为每个PDF文档启用OCR。但即使没有这个特定的配置,Tika在我ile不知何故被破坏或损坏。经过一些尝试和错误之后,我找到了一个解决方案:我使用main方法中的行“LogManager.getLogManager().reset();”来完全禁用日志记录。
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...