使用Lucene短语查询和PDFBOX搜索PDF中的句子_Lucene

使用Lucene短语查询和PDFBOX搜索PDF中的句子

lucene

使用Lucene短语查询和PDFBOX搜索PDF中的句子,lucene,Lucene,我使用以下代码在pdf中搜索文本。它可以很好地处理单个单词。但是对于代码中提到的句子，它表明即使文本存在于文档中，它也不存在。有谁能帮我解决这个问题吗 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); // Store the index in memory: Directory directory = new RA

我使用以下代码在pdf中搜索文本。它可以很好地处理单个单词。但是对于代码中提到的句子，它表明即使文本存在于文档中，它也不存在。有谁能帮我解决这个问题吗

          Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);

            // Store the index in memory:               
            Directory directory = new RAMDirectory();
            // To store an index on disk, use this instead:
            //Directory directory = FSDirectory.open("/tmp/testindex");
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
            IndexWriter iwriter = new IndexWriter(directory, config);
            Document doc = new Document();
            PDDocument document = null;
                try {
                    document = PDDocument.load(strFilepath);
                } 
                catch (IOException ex) {
                    System.out.println("Exception Occured while Loading the document: " + ex);
                }
                int i =1;
                String name = null;           
              String output=new PDFTextStripper().getText(document); 
            //String text = "This is the text to be indexed";
            doc.add(new Field("contents", output, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
            iwriter.close();
            // Now search the index
            DirectoryReader ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);
            // Parse a simple query that searches for "text":
            QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", analyzer);

            String sentence = "Following are the";
            PhraseQuery query = new PhraseQuery();
            String[] words = sentence.split(" ");
            for (String word : words) {
               query.add(new Term("contents", word));
            }
            ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
            if(hits.length>0){
                System.out.println("Searched text existed in the PDF.");
            }
            ireader.close();
            directory.close();
         }
         catch(Exception e){
             System.out.println("Exception: "+e.getMessage());
         }
 }

您应该使用查询解析器根据句子创建查询，而不是自己创建短语查询。您自己创建的查询包含未编入索引的术语Following，因为标准分析器在编制索引时会将其小写，因此只有Following被编入索引。

我使用了queryparser。但是，这个问题仍然没有得到完整的答案。相反，它是采取第一个字，并表明它是不存在的。我为queryparser.queryparser queryparser=newqueryparserversion.LUCENE\u CURRENT，contents，analyzer使用了以下代码：；queryParser.setdefaultoperator queryParser.Operator.AND；queryParser.setPhraseSlop0；Query Query=queryParser.createPhraseQuerycontents，句子；ScoreDoc[]hits=isearcher.searchquery，null，1000.scoreDocs；standardanalyzer会过滤掉停止词，因此您的查询将成为唯一的内容：不管怎样，都是以下内容。这真的意味着你的pdf文本中不存在下面这个词。你能打印出字符串“output”吗？我确信没有以下内容。请建议我必须使用哪种分析器，以获得查询中的完整句子。我是否有可能在standardanalyzer的帮助下完成这项工作？是的，您在standardanalyzer的构造函数中提供了一个空的stopwords集合。那你就不会说废话了。但是：我认为你的问题不在于分析仪，而在于你pdf的内容。