Java 如何使用ApachePDFBox从按钮中提取标签文本?

Java 如何使用ApachePDFBox从按钮中提取标签文本?,java,pdf,pdfbox,Java,Pdf,Pdfbox,假设我设法将一个PDTerminalField作为pdbutdown的一个实例。 但是看看提供的API,我猜不出如何提取所述按钮的标签 由于应用程序的冗长性,未添加代码。 这是一个示例。(感谢@Tilman在这里纠正我的错误。)确实存在这样一个属性,您可以通过getAppearanceCharacteristics().getNormalCaption()访问它,但是该属性是可选的,并且它的内容不能保证与按钮的视觉外观一致,因为外观流可能包含不同的信息。因此,可能需要查询属性和读取外观流的组合策

假设我设法将一个PDTerminalField作为pdbutdown的一个实例。 但是看看提供的API,我猜不出如何提取所述按钮的标签

由于应用程序的冗长性,未添加代码。 这是一个示例。

(感谢@Tilman在这里纠正我的错误。)确实存在这样一个属性,您可以通过
getAppearanceCharacteristics().getNormalCaption()
访问它,但是该属性是可选的,并且它的内容不能保证与按钮的视觉外观一致,因为外观流可能包含不同的信息。因此,可能需要查询属性和读取外观流的组合策略

PDF中按钮的外观流可以包含用于绘制按钮的任意数量的图形和文本绘制说明,但此流不一定易于阅读或解析。例如,对于OP提供的示例文件,该流如下所示:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q
public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}
这里已经可以看到按钮文本“My button”,但显然必须进行一些解析来检索它(特别是当文本编码不需要像本例中那样从ASCII派生时),必须对流应用文本提取

不幸的是,PDFBox中的主要文本提取工作,即
PdfTextStripper
类,很难应用于页面内容以外的任何内容。因此,我使用文本剥离器派生的基类,只添加最小的文本排列功能,并将其应用于按钮外观流

import java.io.IOException;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.contentstream.operator.text.BeginText;
import org.apache.pdfbox.contentstream.operator.text.EndText;
import org.apache.pdfbox.contentstream.operator.text.MoveText;
import org.apache.pdfbox.contentstream.operator.text.MoveTextSetLeading;
import org.apache.pdfbox.contentstream.operator.text.NextLine;
import org.apache.pdfbox.contentstream.operator.text.SetCharSpacing;
import org.apache.pdfbox.contentstream.operator.text.SetFontAndSize;
import org.apache.pdfbox.contentstream.operator.text.SetTextHorizontalScaling;
import org.apache.pdfbox.contentstream.operator.text.SetTextLeading;
import org.apache.pdfbox.contentstream.operator.text.SetTextRenderingMode;
import org.apache.pdfbox.contentstream.operator.text.SetTextRise;
import org.apache.pdfbox.contentstream.operator.text.SetWordSpacing;
import org.apache.pdfbox.contentstream.operator.text.ShowText;
import org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLine;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLineAndSpace;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;

public class SimpleXObjectTextStripper extends PDFStreamEngine {
    public SimpleXObjectTextStripper() {
        addOperator(new BeginText());
        addOperator(new Concatenate());
        addOperator(new DrawObject()); // special text version
        addOperator(new EndText());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new NextLine());
        addOperator(new SetCharSpacing());
        addOperator(new MoveText());
        addOperator(new MoveTextSetLeading());
        addOperator(new SetFontAndSize());
        addOperator(new ShowText());
        addOperator(new ShowTextAdjusted());
        addOperator(new SetTextLeading());
        addOperator(new SetMatrix());
        addOperator(new SetTextRenderingMode());
        addOperator(new SetTextRise());
        addOperator(new SetWordSpacing());
        addOperator(new SetTextHorizontalScaling());
        addOperator(new ShowTextLine());
        addOperator(new ShowTextLineAndSpace());
    }

    public String getText(PDFormXObject form) throws IOException {
        stringBuilder.setLength(0);

        processChildStream(form, new PDPage()); 

        return stringBuilder.toString();
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        stringBuilder.append(unicode);
    }

    final StringBuilder stringBuilder = new StringBuilder();
}
()

(我包括了
import
语句,因为PDFBox在这里包含几个类似名称的类。)

使用此简单的自定义剥离器类,可以从字段外观中提取文本内容,如下所示:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q
public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}
public void showNormalFieldAppearanceText(PDDocument文档)引发IOException异常{
PDAcroForm acroForm=document.getDocumentCatalog().getAcroForm();
如果(acroForm!=null){
SimpleXObjectTextStripper剥离器=新的SimpleXObjectTextStripper();
for(PDField字段:acroForm.getFieldTree()){
if(PDTerminalField的字段实例){
PDTerminalField terminalField=(PDTerminalField)字段;
System.out.println();
System.out.println(“*”+terminalField.getFullyQualifiedName());
for(PDAnnotationWidget小部件:terminalField.getWidgets()){
PDAppearanceDictionary外观=widget.getAppearance();
if(外观!=null){
PDAppearanceEntry normal=appearance.getnormalearance();
如果(正常!=null){
映射流=normal.isSubDictionary()?normal.getSubDictionary():
Collections.singletonMap(COSName.DEFAULT,normal.getAppearanceStream());
for(Map.Entry:streams.entrySet()){
String text=stripper.getText(entry.getValue());
System.out.printf(“*%s:%s\n”,entry.getKey().getName(),text);
}
}
}
}
}
}
}
}
(helper方法)

(感谢@Tilman在这里纠正我。)确实存在这样一个属性,您可以通过
getAppearanceCharacteristics().getNormalCaption()
访问它,但是该属性是可选的,并且它的内容不能保证与按钮的视觉外观一致,因为外观流可能包含不同的信息。因此,可能需要查询属性和读取外观流的组合策略

PDF中按钮的外观流可以包含用于绘制按钮的任意数量的图形和文本绘制说明,但此流不一定易于阅读或解析。例如,对于OP提供的示例文件,该流如下所示:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q
public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}
这里已经可以看到按钮文本“My button”,但显然必须进行一些解析来检索它(特别是当文本编码不需要像本例中那样从ASCII派生时),必须对流应用文本提取

不幸的是,PDFBox中的主要文本提取工作,即
PdfTextStripper
类,很难应用于页面内容以外的任何内容。因此,我使用文本剥离器派生的基类,只添加最小的文本排列功能,并将其应用于按钮外观流

import java.io.IOException;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.contentstream.operator.text.BeginText;
import org.apache.pdfbox.contentstream.operator.text.EndText;
import org.apache.pdfbox.contentstream.operator.text.MoveText;
import org.apache.pdfbox.contentstream.operator.text.MoveTextSetLeading;
import org.apache.pdfbox.contentstream.operator.text.NextLine;
import org.apache.pdfbox.contentstream.operator.text.SetCharSpacing;
import org.apache.pdfbox.contentstream.operator.text.SetFontAndSize;
import org.apache.pdfbox.contentstream.operator.text.SetTextHorizontalScaling;
import org.apache.pdfbox.contentstream.operator.text.SetTextLeading;
import org.apache.pdfbox.contentstream.operator.text.SetTextRenderingMode;
import org.apache.pdfbox.contentstream.operator.text.SetTextRise;
import org.apache.pdfbox.contentstream.operator.text.SetWordSpacing;
import org.apache.pdfbox.contentstream.operator.text.ShowText;
import org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLine;
import org.apache.pdfbox.contentstream.operator.text.ShowTextLineAndSpace;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.util.Matrix;
import org.apache.pdfbox.util.Vector;

public class SimpleXObjectTextStripper extends PDFStreamEngine {
    public SimpleXObjectTextStripper() {
        addOperator(new BeginText());
        addOperator(new Concatenate());
        addOperator(new DrawObject()); // special text version
        addOperator(new EndText());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new NextLine());
        addOperator(new SetCharSpacing());
        addOperator(new MoveText());
        addOperator(new MoveTextSetLeading());
        addOperator(new SetFontAndSize());
        addOperator(new ShowText());
        addOperator(new ShowTextAdjusted());
        addOperator(new SetTextLeading());
        addOperator(new SetMatrix());
        addOperator(new SetTextRenderingMode());
        addOperator(new SetTextRise());
        addOperator(new SetWordSpacing());
        addOperator(new SetTextHorizontalScaling());
        addOperator(new ShowTextLine());
        addOperator(new ShowTextLineAndSpace());
    }

    public String getText(PDFormXObject form) throws IOException {
        stringBuilder.setLength(0);

        processChildStream(form, new PDPage()); 

        return stringBuilder.toString();
    }

    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        stringBuilder.append(unicode);
    }

    final StringBuilder stringBuilder = new StringBuilder();
}
()

(我包括了
import
语句,因为PDFBox在这里包含几个类似名称的类。)

使用此简单的自定义剥离器类,可以从字段外观中提取文本内容,如下所示:

1 0.75 0.666656 rg
0 0 72 20 re
f
q
1 1 70 18 re
W
n
0 g
BT
/HeBo 12 Tf
0 g
6.696 5.857 Td
(My ) Tj
19.992 0 Td
(Button) Tj
ET
Q
public void showNormalFieldAppearanceTexts(PDDocument document) throws IOException {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();

    if (acroForm != null) {
        SimpleXObjectTextStripper stripper = new SimpleXObjectTextStripper();

        for (PDField field : acroForm.getFieldTree()) {
            if (field instanceof PDTerminalField) {
                PDTerminalField terminalField = (PDTerminalField) field;
                System.out.println();
                System.out.println("* " + terminalField.getFullyQualifiedName());
                for (PDAnnotationWidget widget : terminalField.getWidgets()) {
                    PDAppearanceDictionary appearance = widget.getAppearance();
                    if (appearance != null) {
                        PDAppearanceEntry normal = appearance.getNormalAppearance();
                        if (normal != null) {
                            Map<COSName, PDAppearanceStream> streams = normal.isSubDictionary() ? normal.getSubDictionary() :
                                Collections.singletonMap(COSName.DEFAULT, normal.getAppearanceStream());
                            for (Map.Entry<COSName, PDAppearanceStream> entry : streams.entrySet()) {
                                String text = stripper.getText(entry.getValue());
                                System.out.printf("  * %s: %s\n", entry.getKey().getName(), text);
                            }
                        }
                    }
                }
            }
        }
    }
}
public void showNormalFieldAppearanceText(PDDocument文档)引发IOException异常{
PDAcroForm acroForm=document.getDocumentCatalog().getAcroForm();
如果(acroForm!=null){
SimpleXObjectTextStripper剥离器=新的SimpleXObjectTextStripper();
for(PDField字段:acroForm.getFieldTree()){
if(PDTerminalField的字段实例){
PDTerminalField terminalField=(PDTerminalField)字段;
System.out.println();
System.out.println(“*”+terminalField.getFullyQualifiedName());
for(PDAnnotationWidget小部件:terminalField.getWidgets()){
PDAppearanceDictionary外观=widget.getAppearance();
if(外观!=null){
PDAppearanceEntry normal=appearance.getnormalearance();
如果(正常!=null){
映射流=normal.isSubDictionary()?normal.getSubDictionary():