C# 阅读PDF并找到要添加到列表中的特定列_C#_Pdf_Itextsharp

C# 阅读PDF并找到要添加到列表中的特定列

c# pdf

C# 阅读PDF并找到要添加到列表中的特定列,c#,pdf,itextsharp,C#,Pdf,Itextsharp,那么，任何人都能找到一种方法，以编程方式读取.PDF文件列中的数字吗？换言之，是否有可能删除一个PDF文件，然后制作一个能吸收它的东西，然后读出整个专栏该列的格式如下： 40123211155713 以下代码将使用iTextSharp打开并将任何PDF读入字符串： public static string ReadPdfFile(string fileName) { StringBuilder text = new StringBuilder(); if (File.Exis

那么，任何人都能找到一种方法，以编程方式读取

.PDF

文件列中的数字吗？换言之，是否有可能删除一个PDF文件，然后制作一个能吸收它的东西，然后读出整个专栏

该列的格式如下：

40123211155713

以下代码将使用iTextSharp打开并将任何PDF读入字符串：

public static string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

您需要使用一些PDF处理库。这里有一个关于该主题的讨论的SO链接：

看看这个@Jared的帖子是一个很好的开始，但请记住PDF不存储表，只存储碰巧看起来像表的东西。@ChrisHaas我意识到了这一点，但因为我只需要一列，Jared的答案非常有效！感谢这使用了

SimpleTextExtractionsStrategy

——根据所讨论的用例，您可能需要不同的文本提取策略，例如

LocationTextExtractionsStrategy。

string text = ReadPdfFile(@"path\to\pdf\file.pdf");
Regex regex = new Regex(@"(?<number>\d{15})");
List<string> results = new List<string>();
foreach (Match m in regex.Matches(text))
{
    results.Add(m.Groups["number"].Value);
}