Java 即使PDFBox中没有与布局相关的属性(/A在文档目录结构中),也要获取标记的相关BBox?

Java 即使PDFBox中没有与布局相关的属性(/A在文档目录结构中),也要获取标记的相关BBox?,java,pdf,accessibility,pdfbox,tagged-pdf,Java,Pdf,Accessibility,Pdfbox,Tagged Pdf,当他们在根结构中选择标记时,我想突出显示特定标记的bbox。由于这个原因,当标记包含这样的属性时,我能够获得bbox 但我在一些pdf中发现,即使没有像/A这样的属性,Adobe dc也可以在您选择特定标记时突出显示contentbbox。在这种情况下我怎样才能得到bbox?我试图获取与bbox相关的属性的代码是 String inputPdfFile = "D:/Documents/pdfs/res.pdf"; PDDocument old_document = PDDocument.loa

当他们在根结构中选择标记时,我想突出显示特定标记的bbox。由于这个原因,当标记包含这样的属性时,我能够获得bbox

但我在一些pdf中发现,即使没有像/A这样的属性,Adobe dc也可以在您选择特定标记时突出显示contentbbox。在这种情况下我怎样才能得到bbox?我试图获取与bbox相关的属性的代码是

String inputPdfFile = "D:/Documents/pdfs/res.pdf";
PDDocument old_document = PDDocument.load(new File(inputPdfFile));
PDStructureTreeRoot treeRoot = old_document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
    for (Object kid2 :((PDStructureElement)kid).getKids()){
        PDStructureElement kid2c = (PDStructureElement)kid2;
        for (Object kid3 : kid2c.getKids()){
            if (kid3 instanceof PDStructureElement){
                PDStructureElement kid3c = (PDStructureElement)kid3;
                System.out.println(kid3c.getAttributes());
            }
        }
    }
}
pdf链接是


请帮助我找到任何一个……

要确定与某些结构元素布局属性中给出的框相比,某些标记内容的文本的实际边界框,可以使用PDFBox PDFMarkedContentExtractor并将其结果与PDF结构树数据相结合

以下代码执行此操作并创建一个输出PDF,其中确定的边界框包含在彩色矩形中:

PDDocument document = PDDocument.load(SOURCE);

Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();

for (PDPage page : document.getPages()) {
    PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
    extractor.processPage(page);

    Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
    markedContents.put(page, theseMarkedContents);
    for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
        addToMap(theseMarkedContents, markedContent);
    }
}

PDStructureNode root = document.getDocumentCatalog().getStructureTreeRoot();
Map<PDPage, PDPageContentStream> visualizations = new HashMap<>();
showStructure(document, root, markedContents, visualizations);
for (PDPageContentStream canvas : visualizations.values())
    canvas.close();

document.save(RESULT);
辅助方法

showStructure方法递归地确定结构元素的边界框,并分别为每个元素绘制一个矩形。实际上,结构元素可以跨页面包含内容,因此我们必须在其box变量中处理页面到边界框的映射

Map<PDPage, Rectangle2D> showStructure(PDDocument document, PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents, Map<PDPage, PDPageContentStream> visualizations) throws IOException {
    Map<PDPage, Rectangle2D> boxes = null;
    PDPage page = null;
    if (node instanceof PDStructureElement) {
        PDStructureElement element = (PDStructureElement) node;
        page = element.getPage();
    }
    Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page);
    for (Object object : node.getKids()) {
        if (object instanceof COSArray) {
            for (COSBase base : (COSArray) object) {
                if (base instanceof COSDictionary) {
                    boxes = union(boxes, showStructure(document, PDStructureNode.create((COSDictionary) base), markedContents, visualizations));
                } else if (base instanceof COSNumber) {
                    boxes = union(boxes, page, showContent(((COSNumber)base).intValue(), theseMarkedContents));
                } else {
                    System.out.printf("?%s\n", base);
                }
            }
        } else if (object instanceof PDStructureNode) {
            boxes = union(boxes, showStructure(document, (PDStructureNode) object, markedContents, visualizations));
        } else if (object instanceof Integer) {
            boxes = union(boxes, page, showContent((Integer)object, theseMarkedContents));
        } else {
            System.out.printf("?%s\n", object);
        }

    }
    if (boxes != null) {
        Color color = new Color((int)(Math.random() * 256), (int)(Math.random() * 256), (int)(Math.random() * 256));

        for (Map.Entry<PDPage, Rectangle2D> entry : boxes.entrySet()) {
            page = entry.getKey();
            Rectangle2D box = entry.getValue();
            if (box == null)
                continue;

            PDPageContentStream canvas = visualizations.get(page);
            if (canvas == null) {
                canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true);
                visualizations.put(page, canvas);
            }
            canvas.saveGraphicsState();
            canvas.setStrokingColor(color);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
            canvas.restoreGraphicsState();
        }
    }
    return boxes;
}
方法

前两种方法showStructure和showContent使用以下帮助程序来构建按页面的边界框并集:

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D>... maps) {
    Map<PDPage, Rectangle2D> result = null;
    for (Map<PDPage, Rectangle2D> map : maps) {
        if (map != null) {
            if (result != null) {
                for (Map.Entry<PDPage, Rectangle2D> entry : map.entrySet()) {
                    PDPage page = entry.getKey();
                    Rectangle2D rectangle = union(result.get(page), entry.getValue());
                    if (rectangle != null)
                        result.put(page, rectangle);
                }
            } else {
                result = map;
            }
        }
    }
    return result;
}

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D> map, PDPage page, Rectangle2D rectangle) {
    if (map == null)
        map = new HashMap<>();
    map.put(page, union(map.get(page), rectangle));
    return map;
}

Rectangle2D union(Rectangle2D... rectangles)
{
    Rectangle2D box = null;
    for (Rectangle2D rectangle : rectangles) {
        if (rectangle != null) {
            if (box != null)
                box.add(rectangle);
            else
                box = rectangle;
        }
    }
    return box;
}
private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
    GeneralPath path = null;
    AffineTransform at = textRenderingMatrix.createAffineTransform();
    at.concatenate(font.getFontMatrix().createAffineTransform());
    if (font instanceof PDType3Font)
    {
        // It is difficult to calculate the real individual glyph bounds for type 3 fonts
        // because these are not vector fonts, the content stream could contain almost anything
        // that is found in page content streams.
        PDType3Font t3Font = (PDType3Font) font;
        PDType3CharProc charProc = t3Font.getCharProc(code);
        if (charProc != null)
        {
            BoundingBox fontBBox = t3Font.getBoundingBox();
            PDRectangle glyphBBox = charProc.getGlyphBBox();
            if (glyphBBox != null)
            {
                // PDFBOX-3850: glyph bbox could be larger than the font bbox
                glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                path = glyphBBox.toGeneralPath();
            }
        }
    }
    else if (font instanceof PDVectorFont)
    {
        PDVectorFont vectorFont = (PDVectorFont) font;
        path = vectorFont.getPath(code);

        if (font instanceof PDTrueTypeFont)
        {
            PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
            int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
            at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
        }
        if (font instanceof PDType0Font)
        {
            PDType0Font t0font = (PDType0Font) font;
            if (t0font.getDescendantFont() instanceof PDCIDFontType2)
            {
                int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
        }
    }
    else if (font instanceof PDSimpleFont)
    {
        PDSimpleFont simpleFont = (PDSimpleFont) font;

        // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
        // which is why PDVectorFont is tried first.
        String name = simpleFont.getEncoding().getName(code);
        path = simpleFont.getPath(name);
    }
    else
    {
        // shouldn't happen, please open issue in JIRA
        System.out.println("Unknown font class: " + font.getClass());
    }
    if (path == null)
    {
        return null;
    }
    return at.createTransformedShape(path.getBounds2D());
}
方法

示例文档的结果如下:


结构树的元素通过标记的内容ID对应于页面内容或相关内容流中的特定绘图说明。您基本上只需确定这些绘图说明绘制内容的区域。很明显,这只会给出实际的边界框,而不是预期的或保留的框…@mkl感谢您的回复。在附件中,adobe能够获取标记区域。如何获取每个标签的bbox。请提供一些线索。我将尝试在标记时应用绘图指令,并在使用pdfBox获取bbox时使用这些指令。我怀疑您需要在这些对象上调用getCOSObject。如果你查字典,你可以试着调用getItemCOSName.BBox。@tilmahauser如果我正确理解了OP,那么在他现在必须处理的文档中,没有任何属性作为对象。也没有C类名称。因此,如果想要了解布局细节,必须从内容流中的实际绘图说明中派生出来。@mkl是的,您是正确的。我需要执行基于说明的位置来突出显示内容。蒂尔曼说我能做到,但这并不能百分之百地解决我的问题。谢谢帮帮我..谢谢@mkl。非常感谢你。我将在一段时间后使用此代码。非常感谢你。我非常感谢你。嗨@mkl。我用这个文件编辑了上面的代码。它不检测图像。边界框不匹配。这是我正在使用的代码,请检查。它没有检测图像。-对的请参阅我的答案的介绍:要确定与某些结构元素布局属性中给定的边界框相对应的实际边界框,请参阅某些标记内容的文本。。。对于文本以外的内容,PDFMarkedContentExtractor和上面的代码都必须进行一些扩展。让我来试试检测图像所需的更改。我想突出显示每个标记的内容图像、矢量图像和链接。我会再打给你的。谢谢@mkl@mkl谢谢你的回答。我试图从属性中提取BBOx的图像布局。我能够为普通图像提取XOBJECT,但不能为矢量图像提取Means图像加上图形加上像图像一样分组的文本。
Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D>... maps) {
    Map<PDPage, Rectangle2D> result = null;
    for (Map<PDPage, Rectangle2D> map : maps) {
        if (map != null) {
            if (result != null) {
                for (Map.Entry<PDPage, Rectangle2D> entry : map.entrySet()) {
                    PDPage page = entry.getKey();
                    Rectangle2D rectangle = union(result.get(page), entry.getValue());
                    if (rectangle != null)
                        result.put(page, rectangle);
                }
            } else {
                result = map;
            }
        }
    }
    return result;
}

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D> map, PDPage page, Rectangle2D rectangle) {
    if (map == null)
        map = new HashMap<>();
    map.put(page, union(map.get(page), rectangle));
    return map;
}

Rectangle2D union(Rectangle2D... rectangles)
{
    Rectangle2D box = null;
    for (Rectangle2D rectangle : rectangles) {
        if (rectangle != null) {
            if (box != null)
                box.add(rectangle);
            else
                box = rectangle;
        }
    }
    return box;
}
private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
    GeneralPath path = null;
    AffineTransform at = textRenderingMatrix.createAffineTransform();
    at.concatenate(font.getFontMatrix().createAffineTransform());
    if (font instanceof PDType3Font)
    {
        // It is difficult to calculate the real individual glyph bounds for type 3 fonts
        // because these are not vector fonts, the content stream could contain almost anything
        // that is found in page content streams.
        PDType3Font t3Font = (PDType3Font) font;
        PDType3CharProc charProc = t3Font.getCharProc(code);
        if (charProc != null)
        {
            BoundingBox fontBBox = t3Font.getBoundingBox();
            PDRectangle glyphBBox = charProc.getGlyphBBox();
            if (glyphBBox != null)
            {
                // PDFBOX-3850: glyph bbox could be larger than the font bbox
                glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                path = glyphBBox.toGeneralPath();
            }
        }
    }
    else if (font instanceof PDVectorFont)
    {
        PDVectorFont vectorFont = (PDVectorFont) font;
        path = vectorFont.getPath(code);

        if (font instanceof PDTrueTypeFont)
        {
            PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
            int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
            at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
        }
        if (font instanceof PDType0Font)
        {
            PDType0Font t0font = (PDType0Font) font;
            if (t0font.getDescendantFont() instanceof PDCIDFontType2)
            {
                int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
        }
    }
    else if (font instanceof PDSimpleFont)
    {
        PDSimpleFont simpleFont = (PDSimpleFont) font;

        // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
        // which is why PDVectorFont is tried first.
        String name = simpleFont.getEncoding().getName(code);
        path = simpleFont.getPath(name);
    }
    else
    {
        // shouldn't happen, please open issue in JIRA
        System.out.println("Unknown font class: " + font.getClass());
    }
    if (path == null)
    {
        return null;
    }
    return at.createTransformedShape(path.getBounds2D());
}