Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/.net/24.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C# 如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException_C#_.net_Apache Tika - Fatal编程技术网

C# 如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException

C# 如何避免TikaOnDotnet.TextExtractor中的System.OutOfMemoryException,c#,.net,apache-tika,C#,.net,Apache Tika,我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会对某些文件抛出System.OutOfMemoryException 下面是一个示例代码: using System; using TikaOnDotNet.TextExtraction; namespace TikaRnD { class Program { static void Main(string[]

我正在使用TikaOnDotnet.TextExtractor来提取各种类型的文件。它作为控制台应用程序在Windows 10(x64)上运行。但有时它会对某些文件抛出System.OutOfMemoryException

下面是一个示例代码:

using System;
using TikaOnDotNet.TextExtraction;

namespace TikaRnD
{
    class Program
    {
        static void Main(string[] args)
        {
            Type IKVM_OpenJDK_A = typeof(com.sun.codemodel.@internal.ClassType);
            Type IKVM_OpenJDK_B = typeof(com.sun.org.apache.xalan.@internal.xsltc.trax.TransformerFactoryImpl);

            var textExtractor = new TextExtractor();
            try
            {
                var teResult = textExtractor.Extract(@"c:\Temp\Largefile.docx");
                Console.WriteLine(teResult.Text.Length);
            }
            catch(Exception ex)
            {
                Console.WriteLine(ex.ToString());
            }
        }
    }
}
Largefile.docx是约6MB的文档,包含大量文本和嵌入图像。当运行它时,我可以看到进程开始消耗越来越多的系统内存。4GB的RAM不够,它以例外情况结束:

TikaOnDotNet.TextExtraction.TextExtractionException: Extraction of text from the file 'c:\Temp\TestData\Largefile.docx' failed. ---> TikaOnDotNet.TextExtraction.TextExtractionException: Extraction failed. ---> System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at org.apache.poi.extractor.ExtractorFactory.createExtractor(OPCPackage pkg)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.CompositeParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at org.apache.tika.parser.AutoDetectParser.parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
   at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 31
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtraction.Stream.StreamTextExtractor.Extract(Func`2 streamFactory, Stream outputStream) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\Stream\StreamTextExtractor.cs:line 43
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(Func`2 streamFactory) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 53
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 19
   --- End of inner exception stack trace ---
   at TikaOnDotNet.TextExtraction.TextExtractor.Extract(String filePath) in C:\projects\tikaondotnet\src\TikaOnDotnet.TextExtractor\TextExtractor.cs:line 28
   at TikaRnD.Program.Main(String[] args) in c:\users\norbert\source\repos\TikaRnD\TikaRnD\Program.cs:line 20
当我在内存较多的系统上使用同一个文件运行示例代码时,它会消耗约10GB的RAM并成功完成提取-提取的内容大小约为50MB


有谁能帮助我理解为什么会出现如此高的内存消耗,以及如何在可能的情况下防止这种情况发生?

将解析器上下文对象设置为切换到流式xlsx读取器?这不是一个解决方案,而是一个关于为什么会发生这种情况的提示:quote
PPT/PPTX,DOC/DOCX和PDF都是只能通过在内存中构建类似DOM的结构来解析的格式,因此它们需要更多的可用内存。XLS/XLSX,以及其他一些,可以在很大程度上以流式方式完成