C# 为什么在阅读PDF时，我的页面内容会变得混乱？_C#_.net_Itext7_Text Extraction

C# 为什么在阅读PDF时，我的页面内容会变得混乱？

c# .net

C# 为什么在阅读PDF时，我的页面内容会变得混乱？,c#,.net,itext7,text-extraction,C#,.net,Itext7,Text Extraction,我正在使用iText7从pdf文件中读取文本。这在第一页很好用。在那之后，页面的内容不知怎么搞混了。所以在文档的第3页，我有几行包含了第1页和第3页的内容。第2页的文本显示了与第1页完全相同的行（但在“真实性”中，它们完全不同）第1页，真实版：~36行，结果36行->很棒第2页，实数：>50行，结果36行（=第1页）第3页，实数：~16行，结果47行（添加并混合第1页的行）阅读本文件时，我使用以下代码： using System; using System.Collections.

我正在使用iText7从pdf文件中读取文本。这在第一页很好用。在那之后，页面的内容不知怎么搞混了。所以在文档的第3页，我有几行包含了第1页和第3页的内容。第2页的文本显示了与第1页完全相同的行（但在“真实性”中，它们完全不同）

第1页，真实版：~36行，结果36行->很棒
第2页，实数：>50行，结果36行（=第1页）
第3页，实数：~16行，结果47行（添加并混合第1页的行）

阅读本文件时，我使用以下代码：

using System;
using System.Collections.Generic;
using System.Linq;

namespace StockMarket
{
    class PdfReader
    {
        /// <summary>
        /// Reads PDF file by a given path.
        /// </summary>
        /// <param name="path">The path to the file</param>
        /// <param name="pageCount">The number of pages to read (0=all, 1 by default) </param>
        /// <returns></returns>
        public static DocumentTree PdfToText(string path, int pageCount=1 )
        {
            var pages = new DocumentTree();
            using (iText.Kernel.Pdf.PdfReader reader = new iText.Kernel.Pdf.PdfReader(path))
            {
                using (iText.Kernel.Pdf.PdfDocument pdfDocument = new iText.Kernel.Pdf.PdfDocument(reader))
                {
                    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

                    // set up pages to read
                    int pagesToRead = 1;
                    if (pageCount > 0)
                    {
                        pagesToRead = pageCount;
                    }
                    if (pagesToRead > pdfDocument.GetNumberOfPages() || pageCount==0)
                    {
                        pagesToRead = pdfDocument.GetNumberOfPages();
                    }

                    // for each page to read...
                    for (int i = 1; i <= pagesToRead; ++i)
                    {
                        // get the page and save it
                        var page = pdfDocument.GetPage(i);
                        var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
                        pages.Add(txt);
                    }
                    pdfDocument.Close();
                    reader.Close();
                }
            }
            return pages;
        }

    }

    /// <summary>
    /// A class representing parts of a PDF document.
    /// </summary>
    class DocumentTree
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public DocumentTree()
        {
            Pages = new List<DocumentPage>();
        }

        private List<DocumentPage> _pages;
        /// <summary>
        /// The pages of the document
        /// </summary>
        public List<DocumentPage> Pages
        {
            get { return _pages; }
            set { _pages = value; }
        }

        /// <summary>
        /// Adds a <see cref="DocumentPage"/> to the document.
        /// </summary>
        /// <param name="page">The text of the <see cref="DocumentPage"/>.</param>
        public void Add(string page)
        {
            Pages.Add(new DocumentPage(page));
        }
    }

    /// <summary>
    /// A class representing a single page of a document
    /// </summary>
    class DocumentPage
    {
        /// <summary>
        /// Constructor
        /// </summary>
        /// <param name="pageContent">The pages content as text</param>
        public DocumentPage(string pageContent)
        {
            // set the content to the input
            CompletePage = pageContent;

            // split the content by lines
            var splitter = new string[] { "\n" };
            foreach (var line in CompletePage.Split(splitter, StringSplitOptions.None))
            {
                // add lines to the page if the content is not empty
                if (!string.IsNullOrWhiteSpace(line))
                {                    
                    _lines.Add(new Line(line));
                }
            }

        }

        private List<Line> _lines = new List<Line>();
        /// <summary>
        /// The lines of text of the <see cref="DocumentPage"/>
        /// </summary>
        public List<Line> Lines
        {
            get
            {
                return _lines;
            }            
        }

        /// <summary>
        /// The text of the complete <see cref="DocumentPage"/>.
        /// </summary>
        private string CompletePage;
    }

    /// <summary>
    /// A class representing a single line of text
    /// </summary>
    class Line
    {
        /// <summary>
        /// Constructor
        /// </summary>
        public Line(string lineContent)
        {
            CompleteLine = lineContent;
        }

        /// <summary>
        /// The words of the <see cref="Line"/>.
        /// </summary>
        public List<string> Words
        {
            get
            {
                return CompleteLine.Split(" ".ToArray()).Where((word)=> { return !string.IsNullOrWhiteSpace(word); }).ToList();
            }
        }

        /// <summary>
        /// The complete text of the <see cref="Line"/>.
        /// </summary>
        private string CompleteLine;

        public override string ToString()
        {
            return CompleteLine;
        }
    }
}

使用系统；
使用System.Collections.Generic；
使用System.Linq；
股票市场
{
类PdfReader
{
/// 
///按给定路径读取PDF文件。
/// 
///文件的路径
///要读取的页数（0=全部，默认为1）
/// 
公共静态DocumentTree PdfToText（字符串路径，int pageCount=1）
{
var pages=newdocumenttree（）；
使用（iText.Kernel.Pdf.PdfReader=new iText.Kernel.Pdf.PdfReader（路径））
{
使用（iText.Kernel.Pdf.PdfDocument PdfDocument=new iText.Kernel.Pdf.PdfDocument（阅读器））
{
var strategy=new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy（）；
//设置要阅读的页面
int pagesToRead=1；
如果（页面计数>0）
{
pagesToRead=页面计数；
}
如果（pagesToRead>pdfDocument.GetNumberOfPages（）| | pageCount==0）
{
pagesToRead=pdfDocument.GetNumberOfPages（）；
}
//要阅读的每一页。。。
for（inti=1；i{return！string.IsNullOrWhiteSpace（word）；}）.ToList（）；
}
}
/// 
///报告的全文。
/// 
私有字符串完整行；
公共重写字符串ToString（）
{
完全回归线；
}
}
}

页面树是一个包含页面的简单树，由行（读取页面按“\n”拆分）和由单词组成的行（按“”拆分）组成，但循环中的txt已包含混乱的内容（因此我的树不会导致问题）

感谢您的帮助。

某些解析事件侦听器，尤其是大多数文本提取策略，不打算在多个页面上重复使用。相反，您应该为每个页面创建一个新实例

根据经验法则，每个这样的侦听器在解析页面时收集信息，然后允许您访问该数据（就像文本提取策略允许您访问收集的页面文本），如果您不希望所有页面的数据累积，则很可能必须为每个页面分别实例化

因此，在代码中移动策略实例化

var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();

进入

for

循环：

// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
    var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
    pages.Add(txt);
}

//对于要读取的每个页面。。。
对于（int i=1；i）PDF文档并不像HTML和XML文档那样“结构化”-PDF实际上只是一组绘图说明，可以以任何顺序出现以生成最终呈现的页面输出。您只能“正确”读取PDF文件如果它们是带标签的PDF，那么很多PDF生成器都不会给PDF添加标签，这使得它们几乎不可能是机器可读的。你能分享有问题的PDF吗？mkl:done，@Dai:OK，但我认为这只能解释一页文本中的错误，而不是当显示的文本根本不在页面上时，不是吗？当心使用itext；their库包含一个定时炸弹。该产品表面上是商业和AGPL双重许可，但作者误解AGPL要求您的项目也是OSS。如果您在没有商业或OSS许可证密钥的私人项目中使用itext，itext将在几个月后开始清除唠叨的日志垃圾邮件……对您来说时间刚好足够投入生产并认为一切都很好。
// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
    // get the page and save it
    var page = pdfDocument.GetPage(i);
    var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
    pages.Add(txt);
}