短语查询+；Lucene 4.6不适用于PDF Word搜索_Pdf_Lucene

短语查询+；Lucene 4.6不适用于PDF Word搜索

pdf lucene

短语查询+；Lucene 4.6不适用于PDF Word搜索,pdf,lucene,Pdf,Lucene,Iam使用lucene 4.6版本和短语查询从PDF中搜索单词。下面是我的代码。在这里，我可以从PDF中获取输出文本，也可以将查询作为内容：“以下是内容”。但点击数显示为0。有什么建议吗？？提前谢谢 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46); // Store the index in memory: Directo

Iam使用lucene 4.6版本和短语查询从PDF中搜索单词。下面是我的代码。在这里，我可以从PDF中获取输出文本，也可以将查询作为内容：“以下是内容”。但点击数显示为0。有什么建议吗？？提前谢谢

            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);

            // Store the index in memory:               
            Directory directory = new RAMDirectory();
            // To store an index on disk, use this instead:
            //Directory directory = FSDirectory.open("/tmp/testindex");
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
            IndexWriter iwriter = new IndexWriter(directory, config);
            iwriter.deleteAll();
            iwriter.commit();
            Document doc = new Document();
            PDDocument document = null;
                try {
                    document = PDDocument.load(strFilepath);
                } 
                catch (IOException ex) {
                    System.out.println("Exception Occured while Loading the document: " + ex);
                }
              String output=new PDFTextStripper().getText(document);
              System.out.println(output);
            //String text = "This is the text to be indexed";
            doc.add(new Field("contents", output, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
            iwriter.close();

            // Now search the index
            DirectoryReader ireader = DirectoryReader.open(directory);
            IndexSearcher isearcher = new IndexSearcher(ireader);
            String sentence = "Following are the";
            //IndexSearcher searcher = new IndexSearcher(directory);
            if(output.contains(sentence)){
                System.out.println("");
            }

           PhraseQuery query = new PhraseQuery();
            String[] words = sentence.split(" ");
            for (String word : words) {
               query.add(new Term("contents", word));
            }

            ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
            // Iterate through the results:
            if(hits.length>0){
                System.out.println("Searched text existed in the PDF.");
            }
            ireader.close();
            directory.close();
         }
         catch(Exception e){
             System.out.println("Exception: "+e.getMessage());
         }

您的

PhraseQuery

不起作用的原因有两个

使用包含a、an和，as，at，be，but，by，for，if，in，into，is，it，no，not，of，on，or，so，that，the，the，the，this，to，was，will的

英语单词集

。这意味着当您在索引中搜索“Following are the”时，是，而将找不到。因此，对于

短语查询

，您将永远不会得到任何结果，因为是，而将永远不会首先在那里进行搜索。解决方法是将此构造函数用于
Analyzer Analyzer=新的StandardAnalyzer（Version.LUCENE_46，chararlyset.EMPTY_SET）编制索引时，这将确保在编制索引时不会从TokenStream 中删除任何单词

StandardAnalyzer 也使用这意味着所有令牌都将标准化为小写。所以Following将被索引为Following，这意味着搜索“Following”不会给出结果。因为这个.toLowerCase（）会帮助你，只需在你的句子中使用这个，你就会从搜索中得到结果还可以看看这一点，它指定了Unicode标准附录#29，后面是。简而言之，在索引时，撇号、引号、句号、小逗号和许多其他字符在某些情况下会被忽略。您确定粘贴了正确的代码吗？您的语句是“2.3” ，您可以将其拆分（“” ），然后将其用作短语查询参数。这没什么意义。抱歉搞混了。这里的句子是“以下是答案”。。