SOLR 4.0字母排序问题_Solr - Fatal编程技术网

SOLR 4.0字母排序问题

solr

SOLR 4.0字母排序问题,solr,Solr,我的SOLR地址数据库有一个问题，很难解决我根据示例文件构建了这个。我基本上是使用修改过的模式运行示例配置 schema.xml：我通过将大约20000个随机测试数据集（如下所示）推送到post.jar来填充数据库： 1352498443_1 阿努尔莱嫩 F 虚拟资产/female.jpg 祖格沙夫纳/伊恩第07页同侧性知识是指同侧性知识是指同侧性知识是指同侧性知识和同侧性知识是指同侧性知识和同侧性知识是指同侧性知识和同侧性知识。奥雷姆·拉格纳·埃普苏姆·埃米特厄伦韦格 82

我的SOLR地址数据库有一个问题，很难解决

我根据示例文件构建了这个。我基本上是使用修改过的模式运行示例配置

schema.xml：

我通过将大约20000个随机测试数据集（如下所示）推送到post.jar来填充数据库：


1352498443_1
阿努尔
莱嫩
F
虚拟资产/female.jpg
祖格沙夫纳/伊恩
第07页
同侧性知识是指同侧性知识是指同侧性知识是指同侧性知识和同侧性知识是指同侧性知识和同侧性知识是指同侧性知识和同侧性知识。
奥雷姆·拉格纳·埃普苏姆·埃米特
厄伦韦格
82
76297
吕贝克
242
德国
判定元件
阿努尔。lehnen@lorem-lagna-epsum-emet.de
0392984823
0124111417
0325117132
0171459177

然而，当检索数据时，我似乎对字母排序有问题。考虑以下的查询：

{
“负责人”：{
“状态”：0，
“QTime”：5，
“参数”：{
“排序”：“姓氏”，
“fl”：“姓氏”，
“缩进”：“正确”，
“wt”：“json”，
“q”：“城市：柏林”
}
},
“答复”：{
“numFound”：1094，
“开始”：0，
“文件”：[{
“姓氏”：“威尔”
}, {
“姓s”：“亚伯”
}, {
“姓氏”：“亚当”
}, {
“姓氏”：“阿德”
}, {
“姓氏”：“阿德里安”
}, {
“姓氏”：“艾格纳”
}, {
“姓氏”：“艾格纳”
}, {
“姓氏”：“阿尔伯”
}, {
“姓氏”：“阿尔伯”
}, {
“姓氏”：“阿尔伯斯”
}]
}
}

为什么“Weil”出现在位置1上，而其余数据似乎被正确排序？

我认为，在

文本中应用的一些附加分析器是导致这种排序行为的原因。根据我的经验，排序字符串时要获得最佳结果，请使用下面显示的示例schema.xml附带的alphalysort
字段类型
<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
      -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
    <!-- The PatternReplaceFilter gives you the flexibility to use
         Java Regular expression to replace any sequence of characters
         matching a pattern with an arbitrary replacement string, 
         which may include back references to portions of the original
         string matched by the pattern.

         See the Java Regular Expression documentation for more
         information on pattern and replacement string syntax.

         http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
      -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="([^a-z])" replacement="" replace="all"
    />
  </analyzer>
</fieldType>

注意：不需要将值存储在姓氏排序
字段中，因此属性为stored=“false”
，除非您希望向用户显示该属性
然后，您可以将查询更改为按姓氏排序
进行排序。
排序在多值和标记化字段上不起作用
-

排序可以在文档的“分数”上进行，也可以在任何多值的=“false”index=“true”字段上进行，前提是该字段是非标记化的（即：没有分析器），或者使用只生成单个术语的分析器（即：使用关键字标记器）
使用字符串作为字段类型，并将标题字段复制到新字段中
<field name="surname_s_sort" type="string" indexed="true" stored="false"/>

<copyField source="surname_s" dest="surname_s_sort" />  



正如@Paige所回答的，您可以使用关键字标记器，即不标记字段的小写过滤器
 我也有类似的问题，我尝试了alphaOnlySort。这在某种程度上是可行的，但当字段包含诸如-、/spaces等值时，排序结果就会变得混乱
结果是这样的
/abc
aa
/abc2
所以我最终使用了小写字段类型。它已经存在了，所以我认为它是默认类型。我确实使用了复制字段构造，因此我的最终配置是：
<schema>
    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
    <fields>
       <field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
    </fields>
    <copyField source="job_name" dest="job_name_sort"/>
</schema>


对于任何其他有此问题的人，请注意，在对文档编制索引时会出现copyField。您的假设完全正确。“weil”是GermanAnalyzer的停止词。
<schema>
    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
    <fields>
       <field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
    </fields>
    <copyField source="job_name" dest="job_name_sort"/>
</schema>