Java PDFBox：提取文本时保持PDF结构_Java_Pdfbox

Java PDFBox：提取文本时保持PDF结构

java

Java PDFBox：提取文本时保持PDF结构,java,pdfbox,Java,Pdfbox,我试图从一个充满表格的PDF中提取文本。在某些情况下，列是空的。当我从PDF中提取文本时，空白列被跳过并替换为空白，因此，我的常规表达式无法确定此处是否存在没有信息的列更好地理解图像：我们可以看到，这些列在提取的文本中不受尊重从PDF中提取文本的代码示例： PDFTextStripper reader = new PDFTextStripper(); reader.setSortByPosition(true); reader.setS

我试图从一个充满表格的PDF中提取文本。在某些情况下，列是空的。当我从PDF中提取文本时，空白列被跳过并替换为空白，因此，我的常规表达式无法确定此处是否存在没有信息的列

更好地理解图像：

我们可以看到，这些列在提取的文本中不受尊重

从PDF中提取文本的代码示例：

PDFTextStripper reader = new PDFTextStripper();
            reader.setSortByPosition(true);
            reader.setStartPage(page);
            reader.setEndPage(page);
            String st = reader.getText(document);
            List<String> lines = Arrays.asList(st.split(System.getProperty("line.separator")));

PDFTextStripper读取器=新的PDFTextStripper（）；
reader.setSortByPosition（真）；
reader.setStartPage（第页）；
reader.setEndPage（第页）；
String st=reader.getText（文档）；
List line=Arrays.asList（st.split（System.getProperty（“line.separator”））；

如何在从原始PDF中提取文本时保持其完整结构

非常感谢。

（这原本是OP删除的部分，包括所有答案。由于年代久远，答案中的代码仍然基于PDFBox 1.8.x，因此可能需要进行一些更改以使其与PDFBox 2.0.x一起运行。）

在评论中，OP对扩展PDFBox

PDFTextStripper

以返回文本行的解决方案表示了兴趣，这些文本行试图反映PDF文件布局，这可能有助于解决手头的问题

这一类的概念证明如下：

public class LayoutTextStripper extends PDFTextStripper
{
    public LayoutTextStripper() throws IOException
    {
        super();
    }

    @Override
    protected void startPage(PDPage page) throws IOException
    {
        super.startPage(page);
        cropBox = page.findCropBox();
        pageLeft = cropBox.getLowerLeftX();
        beginLine();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        float recentEnd = 0;
        for (TextPosition textPosition: textPositions)
        {
            String textHere = textPosition.getCharacter();
            if (textHere.trim().length() == 0)
                continue;

            float start = textPosition.getTextPos().getXPosition();
            boolean spacePresent = endsWithWS | textHere.startsWith(" ");

            if (needsWS | spacePresent | Math.abs(start - recentEnd) > 1)
            {
                int spacesToInsert = insertSpaces(chars, start, needsWS & !spacePresent);

                for (; spacesToInsert > 0; spacesToInsert--)
                {
                    writeString(" ");
                    chars++;
                }
            }

            writeString(textHere);
            chars += textHere.length();

            needsWS = false;
            endsWithWS = textHere.endsWith(" ");
            try
            {
                recentEnd = getEndX(textPosition);
            }
            catch (IllegalArgumentException | IllegalAccessException | NoSuchFieldException | SecurityException e)
            {
                throw new IOException("Failure retrieving endX of TextPosition", e);
            }
        }
    }

    @Override
    protected void writeLineSeparator() throws IOException
    {
        super.writeLineSeparator();
        beginLine();
    }

    @Override
    protected void writeWordSeparator() throws IOException
    {
        needsWS = true;
    }

    void beginLine()
    {
        endsWithWS = true;
        needsWS = false;
        chars = 0;
    }

    int insertSpaces(int charsInLineAlready, float chunkStart, boolean spaceRequired)
    {
        int indexNow = charsInLineAlready;
        int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
        int spacesToInsert = indexToBe - indexNow;
        if (spacesToInsert < 1 && spaceRequired)
            spacesToInsert = 1;

        return spacesToInsert;
    }

    float getEndX(TextPosition textPosition) throws IllegalArgumentException, IllegalAccessException, NoSuchFieldException, SecurityException
    {
        Field field = textPosition.getClass().getDeclaredField("endX");
        field.setAccessible(true);
        return field.getFloat(textPosition);
    }

    public float fixedCharWidth = 3;

    boolean endsWithWS = true;
    boolean needsWS = false;
    int chars = 0;

    PDRectangle cropBox = null;
    float pageLeft = 0;
}

fixedCharWidth

是假定的字符宽度。根据所讨论的PDF中的文字，不同的值可能更合适。在我的示例文档中，3..6的值很有趣

它本质上模拟了中的iText的类似解决方案。不过，结果略有不同，因为iText文本提取转发文本块，PDFBox文本提取转发单个字符

请注意，这只是一个概念证明。它尤其不考虑任何旋转

尝试一个类似tabla java的工具，它位于PDFBox之上。PDFBox不会尝试识别表。或者，如果您感兴趣的是

PDFTextStripper

的一个变体，它试图在PDF中存在较大间隙的地方插入额外的空格，我将使用这种变体进行复制。@mkl您的解决方案可能会有所帮助。如果添加的额外空格总是相同的（就字符数而言），它就可以完成这项工作。您的解决方案非常有效。它需要进行一些转换，以匹配我使用的PDBox版本，但第一次运行是有希望的。结构与原始PDF几乎相同。如果有更好的解决方案，我会使用这个解决方案。非常感谢使用

LayoutTextStripper

的解决方案对我的应用程序很有用。但是，有时我会收到像“姓名和地址”这样的文本，比如“姓名和地址”——单词之间缺少一些空格。我正在使用PDFBox 2.0.13。我该怎么做才能正确获得它（我是第一次使用PDFBox，我对使用版本2运行的代码所做的更改可能会导致错误）？谢谢你的建议。好的，我找到了一个适用于PDFBox 2.x的工作版本。正如回答中提到的，它提供了概念证明，所以一些细节可能仍然很粗糙。我认为更改（降低）fixedCharWidth值可能有助于上面的代码。

PDDocument document = PDDocument.load(PDF);

LayoutTextStripper stripper = new LayoutTextStripper();
stripper.setSortByPosition(true);
stripper.fixedCharWidth = charWidth; // e.g. 5

String text = stripper.getText(document);