Java PDFBox-删除不可见文本（通过剪辑/填充路径问题）_Java_Pdf_Pdfbox

Java PDFBox-删除不可见文本（通过剪辑/填充路径问题）

java pdf

Java PDFBox-删除不可见文本（通过剪辑/填充路径问题）,java,pdf,pdfbox,Java,Pdf,Pdfbox,链接到示例PDF:。在这里，您可以看到左侧的许多标签被剪裁（因为一些剪裁说明）当我使用PDFTextStripper时，它会打印示例PDF文件中实际剪切/隐藏的所有文本。我已经尝试过描述的解决方案，但它甚至值得，因为它删除了顶部的许多文本+每行开头的一些文本。是否有其他方法可以使用PDFBox仅显示可见字符并跳过所有重叠字符？或者有没有其他工具可以只返回可见文本？提前感谢。引用的OP中的结束不起作用的原因是，在被覆盖的processTextPosition中计算字符基线的结尾时，没有考虑页

链接到示例PDF:。在这里，您可以看到左侧的许多标签被剪裁（因为一些剪裁说明）

当我使用PDFTextStripper时，它会打印示例PDF文件中实际剪切/隐藏的所有文本。我已经尝试过描述的解决方案，但它甚至值得，因为它删除了顶部的许多文本+每行开头的一些文本。是否有其他方法可以使用PDFBox仅显示可见字符并跳过所有重叠字符？或者有没有其他工具可以只返回可见文本？提前感谢。

引用的OP中的结束不起作用的原因是，在被覆盖的

processTextPosition

中计算字符基线的结尾时，没有考虑页面旋转。但是，如果将该方法更改为仅测试每个字符基线的开头，而忽略结尾，则该方法对手头的文档非常有效：

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));

    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(start.getX(), start.getY()))
        super.processTextPosition(text);
}

使用此

processTextPosition

覆盖文本提取的结果（将

SortByPosition

设置为

true

）：

乍一看，缺少的唯一可见文本是两页页脚中的总页数

正如OP在评论中所说

似乎应该在deleteCharsInPath（）中应用相同的内容

实际上，

deleteCharsInPath

也应更改为：

void deleteCharsInPath() {
    for (List<TextPosition> list : charactersByArticle) {
        List<TextPosition> toRemove = new ArrayList<>();
        for (TextPosition text : list) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            if (linePath.contains(start.getX(), start.getY())) {
                toRemove.add(text);
            }
        }
        if (toRemove.size() != 0) {
            System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
            list.removeAll(toRemove);
        }
    }
}

processTextPosition

和

deleteCharsInPath

需要考虑这些值：

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));

    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
        super.processTextPosition(text);
}

[...]

void deleteCharsInPath() {
    for (List<TextPosition> list : charactersByArticle) {
        List<TextPosition> toRemove = new ArrayList<>();
        for (TextPosition text : list) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            if (linePath.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY())) {
                toRemove.add(text);
            }
        }
        if (toRemove.size() != 0) {
            System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
            list.removeAll(toRemove);
        }
    }
}

@覆盖
受保护的无效processTextPosition（TextPosition文本）{
矩阵textMatrix=text.getTextMatrix（）；
向量开始=textMatrix.transform（新向量（0,0））；
PDGraphicsState gs=getGraphicsState（）；
面积=gs.getCurrentClippingPath（）；
if（area==null | | area.contains（lowerLeftX+start.getX（），lowerLeftY+start.getY（））
super.processTextPosition（文本）；
}
[...]
void deleteCharsInPath（）{
for（列表：charactersByArticle）{
List toRemove=new ArrayList（）；
用于（文本位置文本：列表）{
矩阵textMatrix=text.getTextMatrix（）；
向量开始=textMatrix.transform（新向量（0,0））；
if（linePath.contains（lowerLeftX+start.getX（），lowerLeftY+start.getY（））{
删除。添加（文本）；
}
}
如果（toRemove.size（）！=0）{
System.out.println（“已删除”+toRemove.size（）+“text在对象被覆盖时定位对象”）；
列表。删除所有（删除）；
}
}
}

现在，新文件的提取结果也可以了。；）

非常感谢您的快速响应，效果很好。似乎在deleteCharsInPath（）中也应该应用同样的方法，在这里它处理填充，顺便说一句，页脚甚至不匹配“area.contains（start.getX（），start.getY（）”由于某些原因导致的条件。在本例中，这是可以忽略的，但原因很有趣。例如，在本例中，条件对于顶部的许多文本失败。是否可能需要添加更多类，并向PDFTextStripper子类添加一些额外的指令处理？我还没有分析test2.pdf。但在其背景是，PDFBox

PDFTextStripper

以多种方式规范化坐标，

PDFVisibleTextStripper

的添加很可能不会在每个方面都进行模拟。如果这是这些问题的基础，我不会感到惊讶……我只是第一眼看到。事实上，一个没有起源于使用了ts左下角。这意味着我的路径处理（尚未）模拟的另一个“规范化”，cf.。我会在有时间的时候尝试解决这个问题。那太好了，非常感谢！至少我还有一点要看

float lowerLeftX = 0;
float lowerLeftY = 0;

@Override
public void processPage(PDPage page) throws IOException {
    PDRectangle pageSize = page.getCropBox();

    lowerLeftX = pageSize.getLowerLeftX();
    lowerLeftY = pageSize.getLowerLeftY();

    super.processPage(page);
}

@Override
protected void processTextPosition(TextPosition text) {
    Matrix textMatrix = text.getTextMatrix();
    Vector start = textMatrix.transform(new Vector(0, 0));

    PDGraphicsState gs = getGraphicsState();
    Area area = gs.getCurrentClippingPath();
    if (area == null || area.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY()))
        super.processTextPosition(text);
}

[...]

void deleteCharsInPath() {
    for (List<TextPosition> list : charactersByArticle) {
        List<TextPosition> toRemove = new ArrayList<>();
        for (TextPosition text : list) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            if (linePath.contains(lowerLeftX + start.getX(), lowerLeftY + start.getY())) {
                toRemove.add(text);
            }
        }
        if (toRemove.size() != 0) {
            System.out.println("Removed " + toRemove.size() + " TextPosition objects as they are being covered.");
            list.removeAll(toRemove);
        }
    }
}