如何在java中使用pdfbox从pdf文件中删除页眉和页脚
我正在使用Pdf解析器将Pdf转换为文本。下面是我使用java将Pdf转换为文本文件的代码。 我的PDF文件包含以下数据:如何在java中使用pdfbox从pdf文件中删除页眉和页脚,java,pdfbox,Java,Pdfbox,我正在使用Pdf解析器将Pdf转换为文本。下面是我使用java将Pdf转换为文本文件的代码。 我的PDF文件包含以下数据: Data Sheet(Header) PHP Courses for PHP Professionals(Header) Networking Academy We live in an increasingly connected world, creating a global economy and a growing need for
Data Sheet(Header)
PHP Courses for PHP Professionals(Header)
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
All copyrights reserved.(Footer).
示例代码:
public class PDF_TEST {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDF_TEST() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
File f = new File(fileName);
if (!f.isFile()) {
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
//Extracts text from a PDF Document and writes it to a text file
public static void test() {
String args[]={"C://Sample.pdf","C://Sample.txt"};
if (args.length != 2) {
System.exit(1);
}
PDFTextParser pdfTextParserObj = new PDFTextParser();
String pdfToText = pdfTextParserObj.pdftoText(args[0]);
if (pdfToText == null) {
}
else {
pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
}
}
public static void main(String args[]) throws IOException
{
test();
}
}
上面的代码用于将pdf提取为文本,但我的要求是忽略页眉和页脚,只从pdf文件中提取内容。
所需输出:
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
请告诉我怎么做。
谢谢。一般来说,PDF中的页眉或页脚文本没有什么特别之处。可以对该材料进行不同的标记,但标记是可选的,OP没有提供样本PDF进行检查 因此,通常需要一些手动工作(或某种程度上的故障密集型图像分析)来查找页面上页眉、内容和页脚材料的区域 但是,一旦获得这些区域的坐标,就可以使用
PDFTextStripperByArea
扩展PDFTextStripper
来按区域收集文本。只需使用包含内容但不包括页眉和页脚的最大矩形,以及定义区域的pdfStripper.getText(pdDoc)
callgetTextForRegion
之后,为页面内容定义一个区域
You can use PDFTextStripperByArea to remove "Header" and "Footer" by pdf file.
Code in java using PDFBox.
public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
File file = new File(path + filename);
PDDocument document = PDDocument.load(file);
//Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
String regionName = "region";
PDFTextStripperByArea stripper;
PDPage page = document.getPage(pageNumber + 1);
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
String text = stripper.getTextForRegion(regionName);
return text;
}