如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式_Pdf_Font Size_Detect_Pdfbox

如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式

pdf

如何使用PDFBOX确定文本的人工粗体样式、人工斜体样式和人工轮廓样式,pdf,font-size,detect,pdfbox,Pdf,Font Size,Detect,Pdfbox,我正在使用PDFBox验证pdf文档。检查PDF中的以下文本类型有一定的要求人工粗体文本人工斜体文本人工轮廓样式文本我确实在PDFBOX api列表中搜索过，但找不到这样的api 谁能帮我一下，告诉我如何使用PDFBOX确定PDF中存在的不同类型的人工字体/文本样式。一般程序和PDFBOX问题从理论上讲，我们应该首先从PDFTextStripper派生一个类并重写其方法： /** * Write a Java string to the output stream. The d

我正在使用PDFBox验证pdf文档。检查PDF中的以下文本类型有一定的要求

人工粗体文本
人工斜体文本
人工轮廓样式文本

我确实在PDFBOX api列表中搜索过，但找不到这样的api

谁能帮我一下，告诉我如何使用PDFBOX确定PDF中存在的不同类型的人工字体/文本样式。

一般程序和PDFBOX问题从理论上讲，我们应该首先从

PDFTextStripper

派生一个类并重写其方法：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

然后，您的覆盖应该使用

列表文本位置

，而不是

字符串文本

；每个

TextPosition

基本上代表一个字母以及绘制该字母时激活的图形状态信息

不幸的是，

textPositions

列表不包含当前版本1.8.3中的正确内容。例如，对于行“This is normal text.”从PDF中调用方法

writeString

，对字符串“This”、“is”、“normal”和“text”各调用一次。不幸的是，每次

textPositions

列表都包含最后一个字符串“text”的字母的

TextPosition

实例

事实证明，这已被确认为PDFBox问题，同时已解决版本1.8.4和2.0.0的问题

如上所述，一旦你有了一个固定的PDFBox版本，你就可以检查一些人工样式，如下所示：

人工斜体文本此文本样式在页面内容中创建如下：

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

如果该值相关地大于0.0，则使用人造斜体。如果相对小于0.0，则使用人工向后斜体

人造粗体或轮廓文本这些人工样式使用不同渲染模式的双重打印字母；e、 g.大写字母“T”，以粗体显示：

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

因此，在此上下文中，颜色设置操作符（尤其是当前文档中用于CMYK颜色的操作符）被忽略！幸运的是，

PageDrawer

的这些操作符的实现也可以在这个上下文中使用

因此，下面的概念证明说明了如何检索所有必需的信息

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '\n');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('\n');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

在人工黑体文本中，您可以看到

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

用斜体字表示：

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工轮廓中：

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

所以，这里有了识别这些人工样式所需的所有信息。现在你只需要分析数据

顺便说一句，看看人工加粗的情况：坐标可能并不总是相同的，但只是非常相似。因此，在测试两个文本位置对象是否描述了相同的位置时，需要一些宽容。

一般程序和PDFBox问题从理论上讲，我们应该首先从

PDFTextStripper

派生一个类并重写其方法：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

然后，您的覆盖应该使用

列表文本位置

，而不是

字符串文本

；每个

TextPosition

基本上代表一个字母以及绘制该字母时激活的图形状态信息

不幸的是，

textPositions

列表不包含当前版本1.8.3中的正确内容。例如，对于行“This is normal text.”从PDF中调用方法

writeString

，对字符串“This”、“is”、“normal”和“text”各调用一次。不幸的是，每次

textPositions

列表都包含最后一个字符串“text”的字母的

TextPosition

实例

事实证明，这已被确认为PDFBox问题，同时已解决版本1.8.4和2.0.0的问题

如上所述，一旦你有了一个固定的PDFBox版本，你就可以检查一些人工样式，如下所示：

人工斜体文本此文本样式在页面内容中创建如下：

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

如果该值相关地大于0.0，则使用人造斜体。如果相对小于0.0，则使用人工向后斜体

人造粗体或轮廓文本这些人工样式使用不同渲染模式的双重打印字母；e、 g.大写字母“T”，以粗体显示：

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

因此，在此上下文中，颜色设置操作符（尤其是当前文档中用于CMYK颜色的操作符）被忽略！幸运的是，

PageDrawer

的这些操作符的实现也可以在这个上下文中使用

因此，下面的概念证明说明了如何检索所有必需的信息

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '\n');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('\n');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

在人工黑体文本中，您可以看到

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

用斜体字表示：

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工轮廓中：

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

所以，这里有了识别这些人工样式所需的所有信息。现在你只需要分析数据

顺便说一句，看看人工加粗的情况：坐标可能并不总是相同的，但只是非常相似。因此，在测试两个文本位置对象是否描述了相同的位置时需要一些宽容。

我解决此问题的方法是创建一个新类，扩展PDFTextStripper类并重写函数：

 private void extractTextPosition() throws FileNotFoundException, IOException {

    PDFParser parser = new PDFParser(new FileInputStream(pdf));
    parser.parse();
    StringWriter outString = new StringWriter();
    CustomPDFTextStripper stripper = new CustomPDFTextStripper();
    stripper.writeText(parser.getPDDocument(), outString);
    Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
    for (int i = 0; i < vectorlistoftps.size(); i++) {
        List<TextPosition> tplist = vectorlistoftps.get(i);
        for (int j = 0; j < tplist.size(); j++) {
            TextPosition text = tplist.get(j);
            System.out.println(" String "
          + "[x: " + text.getXDirAdj() + ", y: "
          + text.getY() + ", height:" + text.getHeightDir()
          + ", space: " + text.getWidthOfSpace() + ", width: "
          + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
          + text.getCharacter());
        }       
    }
}

getCharactersByArticle（）

注：PDFBox版本1.8.5

CustomPDFTextStripper类

public class CustomPDFTextStripper extends PDFTextStripper { public CustomPDFTextStripper() throws IOException { super(); } public Vector<List<TextPosition>> getCharactersByArticle(){ return charactersByArticle; } }

public类CustomPDFTextStripper扩展了PDFTextStripper { public CustomPDFTextStripper（）引发IOException{ 超级（）； } 公共向量getCharactersByArticle（）{ 返回字符标签； } }
通过这种方式，我可以解析pdf文档，然后从自定义提取中获取TextPosition