如何在java中使用pdfbox从pdf文件中删除页眉和页脚_Java_Pdfbox

如何在java中使用pdfbox从pdf文件中删除页眉和页脚

java

如何在java中使用pdfbox从pdf文件中删除页眉和页脚,java,pdfbox,Java,Pdfbox,我正在使用Pdf解析器将Pdf转换为文本。下面是我使用java将Pdf转换为文本文件的代码。我的PDF文件包含以下数据： Data Sheet(Header) PHP Courses for PHP Professionals(Header) Networking Academy We live in an increasingly connected world, creating a global economy and a growing need for

我正在使用Pdf解析器将Pdf转换为文本。下面是我使用java将Pdf转换为文本文件的代码。我的PDF文件包含以下数据：

    Data Sheet(Header)
    PHP Courses for PHP Professionals(Header)

   Networking Academy
    We live in an increasingly connected world, creating a global economy and a growing need for technical skills.  Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.

    All copyrights reserved.(Footer).

示例代码：

public class PDF_TEST {
    PDFParser parser;
    String parsedText;
    PDFTextStripper pdfStripper;
    PDDocument pdDoc;
    COSDocument cosDoc;
    PDDocumentInformation pdDocInfo;

    // PDFTextParser Constructor 
    public PDF_TEST() {
    }

    // Extract text from PDF Document
    String pdftoText(String fileName) {


        File f = new File(fileName);

        if (!f.isFile()) {

            return null;
        }

        try {
            parser = new PDFParser(new FileInputStream(f));
        } catch (Exception e) {

            return null;
        }

        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            parsedText = pdfStripper.getText(pdDoc); 
        } catch (Exception e) {

            e.printStackTrace();
            try {
                   if (cosDoc != null) cosDoc.close();
                   if (pdDoc != null) pdDoc.close();
               } catch (Exception e1) {
               e.printStackTrace();
            }
            return null;
        }      

        return parsedText;
    }

    // Write the parsed text from PDF to a file
    void writeTexttoFile(String pdfText, String fileName) {


        try {
            PrintWriter pw = new PrintWriter(fileName);
            pw.print(pdfText);
            pw.close();     
        } catch (Exception e) {

            e.printStackTrace();
        }

    }

    //Extracts text from a PDF Document and writes it to a text file
    public static void test() {
        String args[]={"C://Sample.pdf","C://Sample.txt"};
        if (args.length != 2) {

            System.exit(1);
        }

        PDFTextParser pdfTextParserObj = new PDFTextParser();


        String pdfToText = pdfTextParserObj.pdftoText(args[0]);

        if (pdfToText == null) {

        }
        else {

            pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
        }
    }  

    public static void main(String args[]) throws IOException
    {
        test();
    }
}

上面的代码用于将pdf提取为文本，但我的要求是忽略页眉和页脚，只从pdf文件中提取内容。所需输出：

Networking Academy
        We live in an increasingly connected world, creating a global economy and a growing need for technical skills.  Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.

请告诉我怎么做。

谢谢。

一般来说，PDF中的页眉或页脚文本没有什么特别之处。可以对该材料进行不同的标记，但标记是可选的，OP没有提供样本PDF进行检查

因此，通常需要一些手动工作（或某种程度上的故障密集型图像分析）来查找页面上页眉、内容和页脚材料的区域

但是，一旦获得这些区域的坐标，就可以使用

PDFTextStripperByArea

扩展

PDFTextStripper

来按区域收集文本。只需使用包含内容但不包括页眉和页脚的最大矩形，以及定义区域的

pdfStripper.getText（pdDoc）

call

getTextForRegion

之后，为页面内容定义一个区域

You can use PDFTextStripperByArea to remove "Header" and "Footer" by pdf file.
Code in java using PDFBox.

 public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
        File file = new File(path + filename);
        PDDocument document = PDDocument.load(file);
        //Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
        Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
        String regionName = "region";
        PDFTextStripperByArea stripper;
        PDPage page = document.getPage(pageNumber + 1);
        stripper = new PDFTextStripperByArea();
        stripper.addRegion(regionName, region);
        stripper.extractRegions(page);
        String text = stripper.getTextForRegion(regionName);
        return text;
    }