C# 如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException
我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会对某些文件抛出System.OutOfMemoryException 下面是一个示例代码:C# 如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException,c#,.net,apache-tika,C#,.net,Apache Tika,我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会对某些文件抛出System.OutOfMemoryException 下面是一个示例代码: using System; using TikaOnDotNet.TextExtraction; namespace TikaRnD { class Program { static void Main(string[]
using System;
using TikaOnDotNet.TextExtraction;
namespace TikaRnD
{
class Program
{
static void Main(string[] args)
{
Type IKVM_OpenJDK_A = typeof(com.sun.codemodel.@internal.ClassType);
Type IKVM_OpenJDK_B = typeof(com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl);
var textExtractor = new TextExtractor();
try
{
var teResult = textExtractor.Extract(@"c:\Temp\Largefile.docx");
Console.WriteLine(teResult.Text.Length);
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
}
Largefile.docx是约6MB的文档,包含大量文本和嵌入图像。当运行它时,我可以看到进程开始消耗越来越多的系统内存。4GB的RAM不够,它以例外情况结束:
TikaOnDotNet.TextExtraction.TextExtractionException: Extraction of text from the file 'c:\Temp\TestData\Largefile.docx' failed. ---> TikaOnDotNet.TextExtraction.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 31
--- End of inner exception stack trace ---
at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 43
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(Func`2 streamFactory) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 53
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 19
--- End of inner exception stack trace ---
at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 28
at TikaRnD.Program.Main(String[] args) in c:\users\norbert\source\repos\TikaRnD\TikaRnD\Program.cs:line 20
当我在内存较多的系统上使用同一个文件运行示例代码时,它会消耗约10GB的RAM并成功完成提取-提取的内容大小约为50MB
有谁能帮助我理解为什么会出现如此高的内存消耗,以及如何在可能的情况下防止这种情况发生?将解析器上下文对象设置为切换到流式xlsx读取器?这不是一个解决方案,而是一个关于为什么会发生这种情况的提示:quote
PPT/PPTX,DOC/DOCX和PDF都是只能通过在内存中构建类似DOM的结构来解析的格式,因此它们需要更多的可用内存。XLS/XLSX,以及其他一些,可以在很大程度上以流式方式完成