Lucene.NET：骆驼壳标记器？_Lucene_Lucene.net_Tokenize

Lucene.NET：骆驼壳标记器？

lucene

Lucene.NET：骆驼壳标记器？,lucene,lucene.net,tokenize,Lucene,Lucene.net,Tokenize,我今天开始玩Lucene.NET，我编写了一个简单的测试方法来对源代码文件进行索引和搜索。问题在于，标准分析器/令牌化器将整个驼峰式源代码标识符名称视为单个令牌我正在寻找一种方法，将驼峰大小写标识符（如MaxWidth）分成三个标记：MaxWidth、max和width。我一直在寻找这样的标记器，但我找不到。在写我自己的之前：这方面有什么东西吗？还是有比从头开始编写标记器更好的方法更新：最后，我决定把手弄脏，自己写了一个CamelCaseTokenFilter。我会在我的博客上写一篇关于这个

我今天开始玩Lucene.NET，我编写了一个简单的测试方法来对源代码文件进行索引和搜索。问题在于，标准分析器/令牌化器将整个驼峰式源代码标识符名称视为单个令牌

我正在寻找一种方法，将驼峰大小写标识符（如

MaxWidth

）分成三个标记：

MaxWidth

、

max

和

width

。我一直在寻找这样的标记器，但我找不到。在写我自己的之前：这方面有什么东西吗？还是有比从头开始编写标记器更好的方法

更新：最后，我决定把手弄脏，自己写了一个

CamelCaseTokenFilter

。我会在我的博客上写一篇关于这个问题的帖子，我会更新这个问题。

Solr有一个类似于您所需要的标记器。也许您可以将源代码翻译成C#。下面的链接可能有助于编写自定义标记器

以下是我的实现：

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}

package corp.sap.research.index；
导入java.io.IOException；
导入org.apache.lucene.analysis.TokenFilter；
导入org.apache.lucene.analysis.TokenStream；
导入org.apache.lucene.analysis.tokenattributes.CharterMatAttribute；
公共类CamelCaseFilter扩展了TokenFilter{
私人最终特许权属性；
受保护的CamelCaseScoreFilter（令牌流输入）{
超级（输入）；
此._termAtt=addAttribute（chartermatAttribute.class）；
}
@凌驾
public boolean incrementToken（）引发IOException{
如果（！input.incrementToken（））
返回false；
CharterMattAttribute a=this.getAttribute（charterMattAttribute.class）；
String spliettedString=splitCamelCase（a.toString（））；
_termAtt.setEmpty（）；
_termAtt.append（splettedstring）；
返回true；
}
静态字符串拆分案例（字符串s）{
返回s.replaceAll(
String.format（“%s |%s |%s”，
（？是的，我已经注意到了，尽管它并没有真正实现我想要的功能。最后我自己编写了CamelCaseTokenFilter。但我会接受你的答案。Adir这似乎很有效。下面是我用python实现它的核心：re.sub（'）（？