Java 如何使用iText以正确的顺序从PDF中提取图像?
我正在尝试从PDF文件中提取图像。我在网上找到了一个很好的例子:Java 如何使用iText以正确的顺序从PDF中提取图像?,java,pdf,itext,Java,Pdf,Itext,我正在尝试从PDF文件中提取图像。我在网上找到了一个很好的例子: PdfReader reader; File file = new File("example.pdf"); reader = new PdfReader(file.getAbsolutePath()); for (int i = 0; i < reader.getXrefSize(); i++) { PdfObject pdfobj = reader.getPdfObjec
PdfReader reader;
File file = new File("example.pdf");
reader = new PdfReader(file.getAbsolutePath());
for (int i = 0; i < reader.getXrefSize(); i++) {
PdfObject pdfobj = reader.getPdfObject(i);
if (pdfobj == null || !pdfobj.isStream()) {
continue;
}
PdfStream stream = (PdfStream) pdfobj;
PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
FileOutputStream out = new FileOutputStream(new File(file.getParentFile(), String.format("%1$05d", i) + ".jpg"));
out.write(img);
out.flush();
out.close();
}
}
PdfReader阅读器;
File File=新文件(“example.pdf”);
reader=newpdfReader(file.getAbsolutePath());
对于(int i=0;i
这给了我所有的图像,但图像顺序不对。我的下一次尝试是这样的:
for (int i = 0; i <= reader.getNumberOfPages(); i++) {
PdfDictionary d = reader.getPageN(i);
PdfIndirectReference ir = d.getAsIndirectObject(PdfName.CONTENTS);
PdfObject o = reader.getPdfObject(ir.getNumber());
PdfStream stream = (PdfStream) o;
// rest from example above
}
for(inti=0;i我在别处找到了答案,即iText邮件列表
以下代码适用于我-请注意,我切换到了PdfBox:
PDDocument document = null;
document = PDDocument.load(inFile);
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while (iter.hasNext()) {
PDPage page = (PDPage) iter.next();
PDResources resources = page.getResources();
Map pageImages = resources.getImages();
if (pageImages != null) {
Iterator imageIter = pageImages.keySet().iterator();
while (imageIter.hasNext()) {
String key = (String) imageIter.next();
PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
image.write2OutputStream(/* some output stream */);
}
}
}
PDXObjectImage也是iText的一部分吗?似乎找不到it@FilipeCorreianratx忘记提到他切换到了Apache PDFBox。对于某些PDF文件,需要将行PDResources resources=page.getResources();
替换为PDResources=page.findResources()
此代码仍然以错误的顺序提取图像(使用PDFBox 1.6和1.8进行了测试)。这对我使用itext很有效,保留了顺序并提取了jpg和png图像