Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/311.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/vim/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java iText PDFTextractor getTextFromPage异常;读取文件指针处的字符串时出错;_Java_Jakarta Ee_Itext - Fatal编程技术网

Java iText PDFTextractor getTextFromPage异常;读取文件指针处的字符串时出错;

Java iText PDFTextractor getTextFromPage异常;读取文件指针处的字符串时出错;,java,jakarta-ee,itext,Java,Jakarta Ee,Itext,我正在使用iText PDFTextractor从PdfReader中提取文本,其中PdfReader是从字节数组创建的 byte[] pdfbytes = outputStream.toByteArray(); PdfReader reader = new PdfReader(pdfbytes); int pagenumber = reader.getNumberOfPages(); PdfTextExtractor extractor = new PdfT

我正在使用iText PDFTextractor从PdfReader中提取文本,其中PdfReader是从字节数组创建的

    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = extractor.getTextFromPage(i);
        System.out.println(line);
    }
其中xxxExtensionPdfParser.java:114是String line=extractor.getTextFromPage(i)

但在第二次测试时,我可以毫无例外地获得文本内容。因此,我认为这一定是第一个pdf格式的问题,导致了例外


所以我的问题是,这个格式问题是什么,有没有办法避免呢?谢谢。

我收到了同样的错误,经过调查,我的pdf文档似乎存在问题,因为它们包含“页眉”或“页脚”,而不是您链接的irs文档。我为一个900页的pdf文档编制了索引,其中70页无法提取。显然,所有这些页面都有页脚版权信息。有没有办法解决这个问题

------编辑---------- 我应用以下方法从上述pdf中获取文本。希望这对你也有用


byte[]pdfbytes=outputStream.toByteArray();
PDF读卡器=新的PDF读卡器(pdfbytes);
int pagenumber=reader.getNumberOfPages();
PDFTextractor提取器=新的PDFTextractor(读取器);
对于(int i=1;i
Exception in thread "main" ExceptionConverter: java.io.IOException: Error reading string at file pointer 238291
at com.lowagie.text.pdf.PRTokeniser.throwError(Unknown Source)
at com.lowagie.text.pdf.PRTokeniser.nextToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.nextValidToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.readPRObject(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.parse(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at org.xxx.services.pdfparser.xxxExtensionPdfParser.main(xxxExtensionPdfParser.java:114)
PdfReader pdfReader = new PdfReader(file);
PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);

strategy = parser.processContent(currentPage, new SimpleTextExtractionStrategy());              
content = strategy.getResultantText();
    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = PdfTextExtractor.getTextFromPage(reader,i);
        System.out.println(line);
    }