Python 使用jcc在pylucene/继承中编写自定义分析工具?

Python 使用jcc在pylucene/继承中编写自定义分析工具?,python,pylucene,jcc,Python,Pylucene,Jcc,我想用pylucene编写一个自定义分析器。 通常在JavaLucene中,当您编写analyzer类时,您的类继承lucene的analyzer类 但是pylucene使用jcc,java到c++/python编译器 那么,如何让python类使用jcc从java类继承,尤其是如何编写自定义pylucene分析器呢 谢谢 您可以从pylucene中的任何类继承,但是名称以Python开头的类也将继承,也就是说,当从java代码调用相关方法时,它们将使相关方法成为“虚拟的”。因此,在自定义分析器

我想用pylucene编写一个自定义分析器。 通常在JavaLucene中,当您编写analyzer类时,您的类继承lucene的analyzer类

但是pylucene使用jcc,java到c++/python编译器

那么,如何让python类使用jcc从java类继承,尤其是如何编写自定义pylucene分析器呢


谢谢

您可以从pylucene中的任何类继承,但是名称以Python开头的类也将继承,也就是说,当从java代码调用相关方法时,它们将使相关方法成为“虚拟的”。因此,在自定义分析器的情况下,从PythonAnalyzer继承并实现tokenStream方法。

下面是一个封装EdgeNGram过滤器的分析器示例

import lucene
class EdgeNGramAnalyzer(lucene.PythonAnalyzer):
    '''
    This is an example of a custom Analyzer (in this case an edge-n-gram analyzer)
    EdgeNGram Analyzers are good for type-ahead
    '''

    def __init__(self, side, minlength, maxlength):
        '''
        Args:
            side[enum] Can be one of lucene.EdgeNGramTokenFilter.Side.FRONT or lucene.EdgeNGramTokenFilter.Side.BACK
            minlength[int]
            maxlength[int]
        '''
        lucene.PythonAnalyzer.__init__(self)
        self.side = side
        self.minlength = minlength
        self.maxlength = maxlength

    def tokenStream(self, fieldName, reader):
        result = lucene.LowerCaseTokenizer(Version.LUCENE_CURRENT, reader)
        result = lucene.StandardFilter(result)
        result = lucene.StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
        result = lucene.ASCIIFoldingFilter(result)
        result = lucene.EdgeNGramTokenFilter(result, self.side, self.minlength, self.maxlength)
        return result
下面是重新实现PorterStemmer的另一个示例

# This sample illustrates how to write an Analyzer 'extension' in Python.
# 
#   What is happening behind the scenes ?
#
# The PorterStemmerAnalyzer python class does not in fact extend Analyzer,
# it merely provides an implementation for Analyzer's abstract tokenStream()
# method. When an instance of PorterStemmerAnalyzer is passed to PyLucene,
# with a call to IndexWriter(store, PorterStemmerAnalyzer(), True) for
# example, the PyLucene SWIG-based glue code wraps it into an instance of
# PythonAnalyzer, a proper java extension of Analyzer which implements a
# native tokenStream() method whose job is to call the tokenStream() method
# on the python instance it wraps. The PythonAnalyzer instance is the
# Analyzer extension bridge to PorterStemmerAnalyzer.

'''
More explanation... 
Analyzers split up a chunk of text into tokens...
Analyzers are applied to an index globally (unless you use perFieldAnalyzer)
Analyzers implement Tokenizers and TokenFilters.
Tokenizers break up string into tokens. TokenFilters break of Tokens into more Tokens or filter out
Tokens
'''

import sys, os
from datetime import datetime
from lucene import *
from IndexFiles import IndexFiles


class PorterStemmerAnalyzer(PythonAnalyzer):

    def tokenStream(self, fieldName, reader):

        #There can only be 1 tokenizer in each Analyzer
        result = StandardTokenizer(Version.LUCENE_CURRENT, reader)
        result = StandardFilter(result)
        result = LowerCaseFilter(result)
        result = PorterStemFilter(result)
        result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        return result


if __name__ == '__main__':
    if len(sys.argv) < 2:
        sys.exit("requires at least one argument: lucene-index-path")
    initVM()
    start = datetime.now()
    try:
        IndexFiles(sys.argv[1], "index", PorterStemmerAnalyzer())
        end = datetime.now()
        print end - start
    except Exception, e:
        print "Failed: ", e


你漏了一个字。尤其是“什么?有什么“秘密”让这项工作成功吗??我尝试了从
PythonAnalyzer
ReusableAnalyzerBase
继承,并且在创建查询解析器时都生成了无效的args异常instance@Justin发布了一个自定义分析器的完整示例,该分析器被传递到索引创建中,还添加了perFieldAnalyzer-希望这有帮助。我不确定我发布的示例是否已经过时,无法与您的版本一起使用。@BenDeMott您的示例已经过时,但是
Lucene 8.6.1
附带了一个
test\u perfieldanalyzerrapper.py
文件。谢谢你的指导。
        analyzer = PerFieldAnalyzerWrapper(SimpleAnalyzer())
        analyzer.addAnalyzer("partnum", KeywordAnalyzer())

        query = QueryParser(Version.LUCENE_CURRENT, "description",
                            analyzer).parse("partnum:Q36 AND SPACE")
        scoreDocs = self.searcher.search(query, 50).scoreDocs