Java PdfBox 2.0异常_Java_Pdfbox_Text Extraction

Java PdfBox 2.0异常

java

Java PdfBox 2.0异常,java,pdfbox,text-extraction,Java,Pdfbox,Text Extraction,我有这段代码，它使用PDFBOX2.0从科学论文（pdf）中提取文本公开课分段{ private static ArrayList<String> titles = new ArrayList<String>(); private static ArrayList<Integer> sectionsIndex = new ArrayList<Integer>(); private static HashMap<Str

我有这段代码，它使用PDFBOX2.0从科学论文（pdf）中提取文本

公开课分段{

    private static ArrayList<String> titles = new ArrayList<String>();
    private static ArrayList<Integer> sectionsIndex = new ArrayList<Integer>();
    private static HashMap<String, String> Sections = new HashMap<String,String>();
    private static PDFManager pdfManager = new PDFManager();

    public Sectioning() {
    }

    //This method takes the PDF file and send it to (extractText) in PDFSectionsTitle class to get the titles in the PDF file
    public  ArrayList<String> GetTitles(File file) throws FileNotFoundException
    {
        FileInputStream fis = new FileInputStream(file);
        titles = extractText(fis);
         
        return titles;        
    }
    
    /*This method takes the PDF file and get its text then get the indexes of the titles in the text 
    then send the text to TextSections to get the titles and their sections and store them in a hashmap*/

     public HashMap<String, String> Section(File file) throws IOException
    {
        
        pdfManager.setFilePath(file.getPath()); 

        String text = pdfManager.toText();
        int prevstop = 0;
        
        for (int j = 0 ; j<=titles.size()-1 ; j++)
        {
        prevstop =  text.indexOf(titles.get(j),prevstop);
        sectionsIndex.add(prevstop);
        }
        
        TextSections(text);
       
        return Sections;
    }
    
    //Store in a hashmap the titles with their paragraphs
    public void TextSections(String text) 
    {
    for(int i = 0 ; i <= sectionsIndex.size()-1;i++)
        {
            if(i == sectionsIndex.size()-1) 
            {
               Sections.put(titles.get(i), text.substring(sectionsIndex.get(i)).replaceFirst(titles.get(i), "")); //for last title the paragraph is to the end of the file   
            }
            else
            {
                Sections.put(titles.get(i), text.substring(sectionsIndex.get(i), sectionsIndex.get(i+1)).replaceFirst(titles.get(i), "")); //The paragraphs of the current title ends where the next title exists
            }
        }
    
    }
    
    public void clear() throws IOException{
    titles.clear();
    sectionsIndex.clear();
    Sections.clear();
    pdfManager.closeDoc();
    }

任何人都知道它为什么会给我这个错误吗？即使文件没有损坏，我也使用其他方法提取了它们的文本！

例外情况出现在您的代码中，而不是PDFBox。请仔细查看

section.java:61

，或者向我们展示该类。“没有用于summationdisplay的Unicode映射”这意味着这些字形没有unicode可提取，但这与您的异常无关。@PetrJaneček感谢您的评论，我添加了该类。@Tilmanhausher感谢您的评论，但我确信字形很好；因为我确实提取了其中的文本，效果很好！如果是unicode，我必须查看PDF。可能是大部分的文本摘录，但不是“summationdisplay”标志符号。关于这个例外：找出为什么“sectionsIndex.get（i）”是“-6237”。可能“sectionsIndex”比“sectionsIndex.add（prevstop）”填充的位置更多其中IMHO只能为>=0。例外情况出现在代码中，而不是PDFBox。请仔细查看

Sectioning.java:61

，或者向我们展示该类。“没有用于summationdisplay的Unicode映射”这意味着这些字形没有unicode可提取，但这与您的异常无关。@PetrJaneček感谢您的评论，我添加了该类。@Tilmanhausher感谢您的评论，但我确信字形很好；因为我确实提取了其中的文本，效果很好！如果是unicode，我必须查看PDF。可能是大部分文本摘录，但不是“summationdisplay”图示符。如果是例外：请找出“sectionsIndex.get（i）”为什么是“-6237”。可能“sectionsIndex”的填充位置比“sectionsIndex.add（prevstop）；”多，其中IMHO只能大于等于0。

 Jul 22, 2020 12:03:01 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Jul 22, 2020 12:03:03 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for summationdisplay (88) in font UKPOAO+CMEX10
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -6237
    at java.lang.String.substring(String.java:1911)
    at pdfpapersections.Sectioning.TextSections(Sectioning.java:61)
    at pdfpapersections.Sectioning.Section(Sectioning.java:45)
    at pdfpapersections.PDFPaperSections.main(PDFPaperSections.java:46)
Java Result: 1