C# 将PDF文件解析到内存并搜索特定值_C#_Itext

C# 将PDF文件解析到内存并搜索特定值

c# itext

C# 将PDF文件解析到内存并搜索特定值,c#,itext,C#,Itext,我对整个C#的事情还比较陌生，我试图以更实际的方式来学习它，以获得更多的兴趣和理解。我有一个解析PDF文件的代码，运行良好。不过，我希望写入内存而不是控制台，以便稍后从中搜索InvoiceNumber 我当前写入控制台的代码： using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; using System; using System.Collections.Generic; using System.IO; using Syst

我对整个C#的事情还比较陌生，我试图以更实际的方式来学习它，以获得更多的兴趣和理解。我有一个解析PDF文件的代码，运行良好。不过，我希望写入内存而不是控制台，以便稍后从中搜索InvoiceNumber

我当前写入控制台的代码：

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

namespace PDF_file_reader
{
    class Program
    {
        static void Main(string[] args)
        {

            List<int> InvoiceNumbers = new List<int>();

            string filePath = @"C:\temp\parser\Invoice_Template.pdf";
            int pagesToScan = 2;

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filePath);

                for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
                {
                    ITextExtractionStrategy its = new LocationTextExtractionStrategy();
                    strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                    //creating the string array and storing the PDF line by line
                    string[] lines = strText.Split('\n');
                    foreach (string line in lines)
                    {
                        {
                            //Console.WriteLine($"<{line}>");
                            Console.WriteLine(line.ToString());
                        }
                    }

                    Console.Read();
                }

            }
            catch (Exception ex)
            {
                Console.Write(ex);
            }
        }
    }
}

使用iTextSharp.text.pdf；
使用iTextSharp.text.pdf.parser；
使用制度；
使用System.Collections.Generic；
使用System.IO；
使用系统文本；
命名空间PDF\u文件\u读取器
{
班级计划
{
静态void Main（字符串[]参数）
{
列表发票编号=新列表（）；
字符串filePath=@“C:\temp\parser\Invoice\u Template.pdf”；
int pagesToScan=2；
string strText=string.Empty；
尝试
{
PdfReader reader=新的PdfReader（文件路径）；
对于（int page=1；page只是一个注释，您在foreach
循环中有一组额外的{
}

可以删除

如果要存储屏幕截图中突出显示的整个发票编号（“INV-3337”而不是“3337”），

invoicenumber

需要是字符串列表，而不是整数

我假设发票总是相同的，或者数字总是相同的格式（即“发票号”INV-#######）”，您可以在

foreach

循环中添加一行。由于每个

行

都是一个字符串，您可以检查

行

是否包含“发票号”。如果是，您可以将其添加到

invoicenumber

并删除短语“Invoice Number”。然后将其修剪以去除任何空白。可以在

控制台.Writeline（line.ToString（））上方或下方添加；

您只需添加：

if (line.Contains("Invoice Number"))
    InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());

（我使用了

Replace（）

而不是

Remove（）

，因为您需要知道要删除的短语的起始和结束位置。在我看来，

Replace（）

是这种特殊情况下最安全的方法）

您也可以将

break；

添加到

if

语句中（如果这就是您要查找的内容）。这将停止

foreach

循环。提取发票号后，没有理由查看文档的其余部分，除非您在一个文档中有多张发票

if (line.Contains("Invoice Number"))
{
    InvoiceNumbers.Add(line.Replace("Invoice Number", "").Trim());
    break;
}

如果您想在列表中搜索特定的发票号，应该会有所帮助

这是假设唯一的区别是实际数字。如果不是，您可以随时查看并让它查找类似“INV-\d*”的模式。这也将假设发票号码格式始终相同