如何在C#中使用iTextSharp获取pdf文件中的特定段落？_C#_C# 4.0_Itextsharp

如何在C#中使用iTextSharp获取pdf文件中的特定段落？

c# c#-4.0

如何在C#中使用iTextSharp获取pdf文件中的特定段落？,c#,c#-4.0,itextsharp,C#,C# 4.0,Itextsharp,我在我的C#winform应用程序中使用了iTextSharp。我想得到PDF文件中的特定段落。这在iTextSharp中可能吗？是和否首先是否定的。PDF格式没有文本结构的概念，比如段落、句子甚至单词，它只是有大量的文本。事实上，两行文本彼此接近，因此我们认为它们是结构化的，这是一种人性的东西。当你在PDF中看到一些看起来像三行的段落时，实际上生成PDF的程序将文本切碎成三行不相关的文本，然后在特定的x，y坐标处绘制每行。更糟糕的是，根据设计师的需求，每行文本可能由更小的行组成，可以是单词，

我在我的C#winform应用程序中使用了iTextSharp。我想得到PDF文件中的特定段落。这在iTextSharp中可能吗？

是和否

首先是否定的。PDF格式没有文本结构的概念，比如段落、句子甚至单词，它只是有大量的文本。事实上，两行文本彼此接近，因此我们认为它们是结构化的，这是一种人性的东西。当你在PDF中看到一些看起来像三行的段落时，实际上生成PDF的程序将文本切碎成三行不相关的文本，然后在特定的x，y坐标处绘制每行。更糟糕的是，根据设计师的需求，每行文本可能由更小的行组成，可以是单词，甚至只是字符。因此，可能是

在10,10画“帽子里的猫”

或者

在10,10画“t”，然后在14,10画“h”，然后在18,10画“e”，依此类推。这在Adobe InDesign等设计精良的程序的PDF中非常常见
现在答案是肯定的。事实上，这是一种可能。如果你愿意投入一点工作，你也许可以让iTextSharp做你想要做的事情。有一个名为pdftextractor
的类，它有一个名为GetTextFromPage
的方法，该方法将从页面获取所有原始文本。此方法的最后一个参数是实现itextractionstrategy
接口的对象。如果您创建自己的类来实现这个接口，那么您可以处理每次文本运行并执行自己的逻辑
在这个接口中，有一个名为RenderText
的方法，每次运行文本时都会调用该方法。您将获得一个iTextSharp.text.pdf.parser.TextRenderInfo
对象，从中可以获取运行中的原始文本以及其他内容，如它开始的当前坐标、当前字体等。由于文本的可视行可以由多个运行组成，因此可以使用此方法比较运行的基线（起始x坐标）切换到上一次运行，以确定它是否是同一可视线的一部分
下面是该接口的实现示例：
    public class TextAsParagraphsExtractionStrategy : iTextSharp.text.pdf.parser.ITextExtractionStrategy {
        //Text buffer
        private StringBuilder result = new StringBuilder();

        //Store last used properties
        private Vector lastBaseLine;

        //Buffer of lines of text and their Y coordinates. NOTE, these should be exposed as properties instead of fields but are left as is for simplicity's sake
        public List<string> strings = new List<String>();
        public List<float> baselines = new List<float>();

        //This is called whenever a run of text is encountered
        public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo) {
            //This code assumes that if the baseline changes then we're on a newline
            Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();

            //See if the baseline has changed
            if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2])) {
                //See if we have text and not just whitespace
                if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                    //Mark the previous line as done by adding it to our buffers
                    this.baselines.Add(this.lastBaseLine[Vector.I2]);
                    this.strings.Add(this.result.ToString());
                }
                //Reset our "line" buffer
                this.result.Clear();
            }

            //Append the current text to our line buffer
            this.result.Append(renderInfo.GetText());

            //Reset the last used line
            this.lastBaseLine = curBaseline;
        }

        public string GetResultantText() {
            //One last time, see if there's anything left in the buffer
            if ((!String.IsNullOrWhiteSpace(this.result.ToString()))) {
                this.baselines.Add(this.lastBaseLine[Vector.I2]);
                this.strings.Add(this.result.ToString());
            }
            //We're not going to use this method to return a string, instead after callers should inspect this class's strings and baselines fields.
            return null;
        }

        //Not needed, part of interface contract
        public void BeginTextBlock() { }
        public void EndTextBlock() { }
        public void RenderImage(ImageRenderInfo renderInfo) { }
    }

@很好的解释..我尝试了这段代码来构建一个段落..但是知道坐标位置对我没有帮助..因为文本可以在pdf中的任何位置对齐..但是很好的解释..谢谢
        PdfReader reader = new PdfReader(workingFile);
        TextAsParagraphsExtractionStrategy S = new TextAsParagraphsExtractionStrategy();
        iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1, S);
        for (int i = 0; i < S.strings.Count; i++) {
            Console.WriteLine("Line {0,-5}: {1}", S.baselines[i], S.strings[i]);
        }

        using (FileStream fs = new FileStream(workingFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
            using (Document doc = new Document(PageSize.LETTER)) {
                using (PdfWriter writer = PdfWriter.GetInstance(doc, fs)) {
                    doc.Open();
                    doc.Add(new Paragraph("Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna."));
                    doc.Add(new Paragraph("This"));
                    doc.Add(new Paragraph("Is"));
                    doc.Add(new Paragraph("A"));
                    doc.Add(new Paragraph("Test"));
                    doc.Close();
                }
            }
        }