Java 获取特定行后的PDF文件行_Java_String_File Io_Pdfbox_Text Processing

Java 获取特定行后的PDF文件行

java string file-io

Java 获取特定行后的PDF文件行,java,string,file-io,pdfbox,text-processing,Java,String,File Io,Pdfbox,Text Processing,我使用ApachePDFBox来解析pdf文件中的文本。我试着在一条特定的线路后接一条线路 PDDocument document = PDDocument.load(new File("my.pdf")); if (!document.isEncrypted()) { PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(document); System.out

我使用ApachePDFBox来解析pdf文件中的文本。我试着在一条特定的线路后接一条线路

PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text from pdf:" + text);
} else{
    log.info("File is encrypted!");
}
document.close();

样本：

第1句，文件第n行

所需线路

第3句，n+2行文件

我试图从数组中的文件中获取所有行，但它不稳定，因为无法过滤到特定的文本。这也是第二个解决方案中的问题，这就是为什么我要寻找基于

PDFBox

的解决方案。解决方案1：

String[] lines = myString.split(System.getProperty("line.separator"));

解决方案2：

String neededline = (String) FileUtils.readLines(file).get("n+2th")

事实上，

PDFTextStripper

类的使用与您完全相同的行尾，因此您的第一次尝试使用PDFBox尽可能接近正确

您可以看到，

PDFTextStripper

方法调用的方法与您已经尝试过的方法完全相同，只是使用该方法逐行写入输出缓冲区。此方法返回的结果是buffer.toString（）

因此，对于格式良好的PDF，您真正要问的问题似乎是如何过滤数组中的特定文本。以下是一些想法：

首先，像您所说的那样捕获数组中的行

import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {

    static String[] lines;

    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("my2.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        lines = text.split(System.getProperty("line.separator"));
        document.close();
    }
}

下面是一种通过任何行号索引获取完整字符串的方法，简单：

// returns a full String line by number n
static String getLine(int n) {
    return lines[n];
}

下面是一个线性搜索方法，它查找字符串匹配项并返回找到的第一个行号

// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
    int n = 0;
    for(String line : lines) {
        if(line.indexOf(filter) != -1) {
            return n;
        }
        n++;
    }
    return -1;
}

通过以上操作，您可以仅获取匹配搜索的行号：

System.out.println(getLine(8)); // line 8 for example

System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

或者，包含匹配搜索的整个字符串行：

System.out.println(getLine(8)); // line 8 for example

System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

这一切看起来都很简单，并且只在假设行可以通过行分隔符拆分为数组的情况下才起作用。如果解决方案没有上述想法那么简单，我相信问题的根源可能不在您使用PDFBox的实现中，而是您试图用PDF源对我的进行文本处理
这里有一个指向教程的链接，该教程也完成了您尝试执行的操作：

同样，同样的方法…
“但它不稳定，因为无法过滤到特定文本”-您能解释一下这是什么意思吗？您的解决方案1应该适用于由Microsoft Word等编辑器生成的基本格式PDF。它实际上与PDFBox源代码使用的行分隔符相同。我怀疑有很多奇怪的情况，PDF有奇怪的格式，给你不稳定的结果，但你无法控制，除非你控制创建你需要的PDF文本。这里有一个从PDF中截取行的教程，但它只适用于格式良好的PDF。此外，您只需调用
lines.get（index）
即可获得完成本教程后所需的行号：尝试切换剥离器中的排序选项。