Java 如何在ApacheSolr中展开对象并应用于fieldtype_Java_Indexing_Solr_Lucene

Java 如何在ApacheSolr中展开对象并应用于fieldtype

java indexing solr lucene

Java 如何在ApacheSolr中展开对象并应用于fieldtype,java,indexing,solr,lucene,Java,Indexing,Solr,Lucene,我正在尝试将lucene标记器迁移到ApacheSolr中。我已经为lucene上的每个字段类型（如标题、正文等）编写了TokenizerFactory。在lucene中，有一种方法可以添加到文档中的字段。在solr中，为了与lucene合作，我们必须定制标记器/过滤器。我在某个领域遇到了问题，我已经研究了很多博客和书籍，这些都不能解决我的问题。在大多数blog和book中，他们使用string，int直接指向字段类型我已经为ApacheSolr构建了定制的TokenFilterFactory

我正在尝试将lucene标记器迁移到ApacheSolr中。我已经为lucene上的每个字段类型（如标题、正文等）编写了

TokenizerFactory

。在lucene中，有一种方法可以添加到文档中的字段。在solr中，为了与lucene合作，我们必须定制标记器/过滤器。我在某个领域遇到了问题，我已经研究了很多博客和书籍，这些都不能解决我的问题。在大多数blog和book中，他们使用string，int直接指向字段类型

我已经为ApacheSolr构建了定制的TokenFilterFactory，并将其放在schema.xml中，如下所示

<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
  <filter class="analyzer.ReverseFilterFactory" />
</analyzer>

{
    "id":"0.4470506508669744",
    "title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
    "_version_":1607597126134530048
}

在ApacheSolr管理面板上，结果如下

<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
  <filter class="analyzer.ReverseFilterFactory" />
</analyzer>

{
    "id":"0.4470506508669744",
    "title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
    "_version_":1607597126134530048
}

我无法在自定义TokenStream上获取textWithMarkUp实例，这将阻止我将给定对象展平，就像我之前对lucene所做的那样。在lucene中，我曾经在创建自定义TokenStream实例后设置textWithMarkUp实例。下面是我的textWithMarkUp实例的json版本

{
"text": "The law, which was passed by the Louisiana Legislature and signed by Gov.",
"tokens": [
    {
        "category": "Determiner",
        "canonical": "The",
        "ids": null,
        "start": 0,
        "length": 3,
        "text": "The",
        "order": 0
    },
    //tokenized/stemmed/tagged all the words
],
"abbreviations": [],
"essentialTokenNumber": 12
}

下面的代码就是我要做的

public class TextWithMarkUpTokenizer extends Tokenizer {
    private final PositionIncrementAttribute posIncAtt;
    protected int tokenIndex = -1; // index of the current token in the    collection of metaQTokens
    protected List<MetaQToken> metaQTokens;
    protected TokenStream tokenTokenizer;

    public TextWithMarkUpTokenizer() {
        MetaQTokenTokenizer metaQTokenizer = new MetaQTokenTokenizer();
        tokenTokenizer = metaQTokenizer;
        posIncAtt = addAttribute(PositionIncrementAttribute.class);
    }

    public void setTextWithMarkUp(TextWithMarkUp text) {
      this.markup = text == null ? null : text.getTokens();
    }

    @Override
    public final boolean incrementToken() throws IOException {
      //get instance of TextWithMarkUp here
    }

    private void setCurrentToken(Token token) {
        ((IMetaQTokenAware) tokenTokenizer).setToken(token);
    }
}

public class TextWithMarkUpTokenizer扩展了标记器{
私有最终位置递增属性posIncAtt；
受保护的int-tokenIndex=-1；//MetaQToken集合中当前令牌的索引
受保护的列表metaQTokens；
受保护的令牌流令牌化器；
带有MarkupTokenizer（）的公共文本{
metaQTokenizer metaQTokenizer=新的metaQTokenizer（）；
标记器=元标记器；
posIncAtt=addAttribute（PositionIncrementAttribute.class）；
}
public void setTextWithMarkUp（TextWithMarkUp text）{
this.markup=text==null？null:text.getTokens（）；
}
@凌驾
public final boolean incrementToken（）引发IOException{
//在此处获取TextWithMarkUp的实例
}
私有void setCurrentToken（令牌令牌）{
（（IMetaQTokenAware）标记器）.setToken（标记）；
}
}

我已经使用markuptokenizerfactory跟踪了

TextWithMarkUpTokenizerFactory

类的所有实现，但是一旦我们在Solr上的lib文件夹下加载了jar，Solr将完全控制factory类

那么，有没有办法在solr上的索引时间内设置给定实例？我已经研究过了。无论如何，这是否可以解决我的问题？

Solr搜索结果与索引系统接收到的结果完全相同。这将是所有更新处理器处理后的原始输入。Solr默认使用的更新处理器链不会更改输入

模式中定义的分析链对搜索结果绝对没有影响-它们只影响在索引时和查询时生成的标记。存储的数据不受分析的影响

当您对自定义对象执行“addField”时，很可能会调用下面的SolrJ代码来确定发送给Solr的内容。（val是输入对象）：

这将创建一个字符串，其中类的名称后跟该类的字符串表示形式。正如MatsLindh在一篇评论中所说，SolrJ对您的自定义对象一无所知，因此数据不会作为您的自定义对象类型到达Solr。

但是您提交的是

TextWithMarkup

类的字符串表示形式，它可能只是

文本

部分。Solr对“TextWithMarkup”类一无所知，SolrJ客户端也不知道。您可以尝试在字段类型中将内容序列化为JSON，然后在另一端将其反序列化，或者将内容作为给定内容提交给TextWithMarkup类，然后将TextWithMarkup处理作为过滤器的一部分？@MatsLindh您能告诉我更多关于如何在另一端反序列化的信息吗？我无法在自定义过滤器上检索TextWithMarkup实例最佳解决方案是将用于创建TextWithMarkup实例的内容发送到Solr，然后仅在那里创建实例。另一个选项是使用JSON或Java序列化程序对其进行序列化，然后在另一端取消序列化。有什么原因不能发送内容，然后在Solr端创建TextWithMarkup实例吗？在我的系统中，我已经委托ETL管道中的所有nlp任务。ETL操作完成后，索引操作将开始。你能告诉我更多关于哪方面的信息，我能在solri中反序列化JSON吗？我明白你的答案。我将在solrj上使用json作为addfield，但对于如何在自定义标记器或标记过滤器类上检索这些值，我感到非常困惑