Indexing 为Lucene中的文件路径或URI编制索引_Indexing_Lucene_Lucene.net_Uri_Filepath

Indexing 为Lucene中的文件路径或URI编制索引

indexing lucene

Indexing 为Lucene中的文件路径或URI编制索引,indexing,lucene,lucene.net,uri,filepath,Indexing,Lucene,Lucene.net,Uri,Filepath,我存储在Lucene中的一些文档具有包含文件路径或URI的字段。如果用户的查询条件包含路径或URI段，我希望用户能够检索这些文档例如，如果路径为 C:\home\user\research\whitepapers\analysis\detail.txt 我希望用户能够通过查询路径找到它：白皮书同样，如果URI是 http://www.stackoverflow.com/questions/ask 包含uri:questions的查询将检索它我是否需要为这些字段使用特殊的分析器，或者St

我存储在Lucene中的一些文档具有包含文件路径或URI的字段。如果用户的查询条件包含路径或URI段，我希望用户能够检索这些文档

例如，如果路径为

C:\home\user\research\whitepapers\analysis\detail.txt

我希望用户能够通过查询路径找到它：白皮书

同样，如果URI是

http://www.stackoverflow.com/questions/ask

包含uri:questions的查询将检索它

我是否需要为这些字段使用特殊的分析器，或者StandardAnaylzer会完成这项工作？我是否需要对这些字段进行任何预处理？例如，用空格替换正斜杠或反斜杠

欢迎您的建议

您可以使用StandardAnalyzer。我通过向Lucene添加以下函数来测试这一点：

}

该单元测试通过了Lucene 2.9.1。您可能想尝试使用特定的Lucene发行版。我想它做了你想做的，同时保持域名和文件名不间断。我说过我喜欢单元测试吗？

谢谢！使用StandardAnalyzer索引路径段在Lucene.Net 2.4.0中也可以使用。您知道一个现成的Lucene分析器会将域名在点处分开或将文件名与其扩展名分开吗？也许您可以使用链接了一些过滤器的字母标记器。LetterTokenizer在非字母处分割文本。

public void testBackslashes() throws Exception {
  assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
  assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});