如何在JavaPDFBox中按结果拆分pdf文件_Java_Pdfbox

如何在JavaPDFBox中按结果拆分pdf文件

java

如何在JavaPDFBox中按结果拆分pdf文件,java,pdfbox,Java,Pdfbox,我有一个pdf文件，包含60页。在每一页中，我都使用Apache PDFBOX提供了唯一且重复的发票号Im。 import java.io.*; import org.apache.pdfbox.pdmodel.*; import org.apache.pdfbox.util.*; import java.util.regex.*; public class PDFtest1 { public static void main(String[] args){ PDDocument pd; tr

我有一个pdf文件，包含60页。在每一页中，我都使用Apache PDFBOX提供了唯一且重复的发票号Im。

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;

public class PDFtest1 {
public static void main(String[] args){
PDDocument pd;
try {

     File input = new File("G:\\Sales.pdf");

     // StringBuilder to store the extracted text
     StringBuilder sb = new StringBuilder();
     pd = PDDocument.load(input);
     PDFTextStripper stripper = new PDFTextStripper();

     // Add text to the StringBuilder from the PDF
     sb.append(stripper.getText(pd));


     Pattern p = Pattern.compile("Invoice No.\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d");

     // Matcher refers to the actual text where the pattern will be found
     Matcher m = p.matcher(sb);

     while (m.find()){
         // group() method refers to the next number that follows the pattern we have specified.
         System.out.println(m.group());
     }

     if (pd != null) {
         pd.close();
     }
   } catch (Exception e){
     e.printStackTrace();
    }
 }
 }

我能够使用java正则表达式读取所有发票号。最后，结果如下所示

run:
Invoice No. D0000003010
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003013
Invoice No. D0000003013
Invoice No. D0000003014
Invoice No. D0000003014
Invoice No. D0000003015
Invoice No. D0000003016

我需要根据发票编号拆分pdf。例如，发票编号D000003011，所有pdf页面应合并为单个pdf，依此类推。我能做到什么

publicstaticvoidmain（字符串[]args）抛出IOException、COSVisitorException
public static void main(String[] args) throws IOException, COSVisitorException
{
    File input = new File("G:\\Sales.pdf");

    PDDocument outputDocument = null;
    PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
    PDFTextStripper stripper = new PDFTextStripper();
    String currentNo = null;
    for (int page = 1; page <= inputDocument.getNumberOfPages(); ++page)
    {
        stripper.setStartPage(page);
        stripper.setEndPage(page);
        String text = stripper.getText(inputDocument);
        Pattern p = Pattern.compile("Invoice No.(\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d)");

        // Matcher refers to the actual text where the pattern will be found
        Matcher m = p.matcher(text);
        String no = null;
        if (m.find())
        {
            no = m.group(1);
        }
        System.out.println("page: " + page + ", value: " + no);

        PDPage pdPage = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);

        if (no != null && !no.equals(currentNo))
        {
            saveCloseCurrent(currentNo, outputDocument);
            // create new document
            outputDocument = new PDDocument();
            currentNo = no;
        }
        if (no == null && currentNo == null)
        {
            System.out.println ("header page ??? " + page + " skipped");
            continue;
        }
        // append page to current document
        outputDocument.importPage(pdPage);
    }
    saveCloseCurrent(currentNo, outputDocument);
    inputDocument.close();
}

private static void saveCloseCurrent(String currentNo, PDDocument outputDocument)
        throws IOException, COSVisitorException
{
    // save to new output file
    if (currentNo != null)
    {
        // save document into file
        File f = new File(currentNo + ".pdf");
        if (f.exists())
        {
            System.err.println("File " + f + " exists?!");
            System.exit(-1);
        }
        outputDocument.save(f);
        outputDocument.close();
    }
}

{
文件输入=新文件（“G:\\Sales.pdf”）；
PDDocument outputDocument=null；
PDDocument inputDocument=PDDocument.loadNonSeq（输入，空）；
PDFTextStripper剥离器=新的PDFTextStripper（）；
字符串currentNo=null；
对于（int page=1；page是相同的发票号始终在一起？您使用的是什么PDFBox版本？是的，在某些页面中发票号是唯一的，并且在某些页面中重复相同的值，即重复副本。我使用PDFBox 0.7.3.jar文件，也使用PDFBox-app-1.8.10.jar….@KiranP要跳过空白页面，只需替换“在页面上找不到任何内容”用“continue；”分隔行，即这样的代码：if（！m.find（））{continue；}或者是要求某些页面没有发票#，但仍然必须附加？@KiranP代码已得到改进，希望有帮助。请学会非常精确，即不要编写”第二个PDF是空的“当你真正的意思是”源PDF的第二页没有发票“#”。@KiranP我看到有一个bug，我再次改进了代码。请查看它现在是否工作。