iText C#读取pdf以获取正则表达式匹配，仅将这些页面提取为新pdf_C#_Pdf_Itextsharp

iText C#读取pdf以获取正则表达式匹配，仅将这些页面提取为新pdf

c# pdf

iText C#读取pdf以获取正则表达式匹配，仅将这些页面提取为新pdf,c#,pdf,itextsharp,C#,Pdf,Itextsharp,我在读取现有的pdf以进行正则表达式匹配时遇到了一个问题，然后将这些页面提取为新的pdf。我在整体上遇到了一些问题我决定清醒一下头脑，从头开始。我能够使用以下代码获取一个3页的pdf，并将这些页面单独解压缩到一个新文件中： static void Main(string[] args) { string srcFile = @"C:\Users\steve\Desktop\original.pdf"; string dstFile = @"C:\User

我在读取现有的pdf以进行正则表达式匹配时遇到了一个问题，然后将这些页面提取为新的pdf。我在整体上遇到了一些问题

我决定清醒一下头脑，从头开始。我能够使用以下代码获取一个3页的pdf，并将这些页面单独解压缩到一个新文件中：

static void Main(string[] args)
    {
        string srcFile = @"C:\Users\steve\Desktop\original.pdf";
        string dstFile = @"C:\Users\steve\Desktop\result.pdf";
        PdfReader reader = new PdfReader(srcFile);
        Document document = new Document();
        PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
        document.Open();
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
            copy.AddPage(importedPage);
        }
        document.Close();
    }

static void Main（字符串[]args）
{
字符串srcFile=@“C:\Users\steve\Desktop\original.pdf”；
字符串dstFile=@“C:\Users\steve\Desktop\result.pdf”；
PdfReader reader=新的PdfReader（srcFile）；
文档=新文档（）；
PdfCopy copy=newpdfcopy（文档，新文件流（dstFile，FileMode.Create））；
document.Open（）；
对于@Paulo已经在评论中提出的（int page=1；page）：
在进入循环之前，您必须使用regex或其他任何方式选择页面。在循环中，只有那些页面将被添加
在代码中，这可能如下所示：
string srcFile = @"C:\Users\steve\Desktop\original.pdf";
string dstFile = @"C:\Users\steve\Desktop\result.pdf";

PdfReader reader = new PdfReader(srcFile);
ICollection<int> pagesToKeep = new List<int>();

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    // Use the text extraction strategy of your choice here...
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);

    // Use the content text test of your choice here...
    if (currentText.IndexOf("special") > 0)
    {
        pagesToKeep.Add(page);
    }
}

// Copy selected pages using PdfCopy
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
document.Open();
foreach (int page in pagesToKeep)
{
    PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
    copy.AddPage(importedPage);
}
document.Close();
reader.Close();

后一种变体不仅保留了有问题的页面，还保留了文档级材料，例如全局JavaScript、文档级文件附件等。您是否需要这些，取决于您的用例。
谢谢您的回复mkl。我回答了我的另一篇文章，但忘记了这篇文章。我能够使用Chr提供的测试用例在我的另一个（类似）职位上

通过一些小的调整，我能够得到下面的解决方案来为我的项目工作。
你知道你可以使用另一个PdfReader实例来选择要复制的页面吗？我知道，但是具体是PdfCopy给了我这个问题，也许我还没有完全理解你。我将检查我的代码并发布som这会得到一个正则表达式匹配，所以我不只是问问题，也不发布接近完整的代码。在进入循环之前，你必须用正则表达式或其他任何方式选择页面。在循环中，只有那些页面会被添加。我不明白为什么必须在任何循环中创建PDFCopy实例。
// Copy selected pages using PdfStamper
reader.SelectPages(pagesToKeep);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();