索引时Solr不标记化_Solr - Fatal编程技术网

索引时Solr不标记化

solr

索引时Solr不标记化,solr,Solr,我对Solr非常陌生，所以我认为我有一些非常明显的错误我已经配置了一个core，以便从名为：db.podcast的数据库表中导入一些字段。我用DIH来做这个。我从该表中选取了4个字段： podcast_id, podcast_desc, podcast_name, podcast_keywords 这似乎进展顺利，数据被添加到索引中但是，当我在模式浏览器中检查字段的术语信息时，它似乎没有正确地索引该字段。它没有把所有的播客描述分解成单个的单词，而是给了我完整的描述例如，我期待着这样的列表

我对Solr非常陌生，所以我认为我有一些非常明显的错误

我已经配置了一个core，以便从名为：db.podcast的数据库表中导入一些字段。我用DIH来做这个。我从该表中选取了4个字段：

podcast_id, podcast_desc, podcast_name, podcast_keywords

这似乎进展顺利，数据被添加到索引中

但是，当我在模式浏览器中检查字段的术语信息时，它似乎没有正确地索引该字段。它没有把所有的播客描述分解成单个的单词，而是给了我完整的描述

例如，我期待着这样的列表：

201 A
196 The
185 Then....

相反，我得到了这样一个列表（我添加了点以节省空间：）：

My schema.xml如下所示：

<types>
  <fieldtype name='text' class='solr.TextField' >
    <analyzer type="index" >
      <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory"
        ignoreCase="true"
        words="stopwords.txt" />
      <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
  </fieldtype>
  <fieldtype name='int' class='solr.TrieIntField' />
</types>

<fields>
    <field name='podcast_id' indexed="true" stored="true" type='int' required="true"/>
    <field name='podcast_name' indexed="true" stored="true" type='text' />
    <field name='podcast_desc' indexed="true" stored="true" type='text' />
    <field name='podcast_keywords' indexed="true" stored="true" type='text' />
</fields>

<uniqueKey>podcast_id</uniqueKey>

<entity name="podcast" query="select podcast_id, podcast_name, podcast_desc, podcast_keywords from db.podcast"
  deltaQuery="select podcast_id from db.podcast where last_modified > '${dataimporter.last_index_time}'"
  deltaImportQuery="select podcast_id, podcast_name, podcast_desc, podcast_keywords from db.podcast where podcast_id='dataimporter.delta.podcast_id'">      
</entity>


播客id

我的DIH文档实体如下所示：

<types>
  <fieldtype name='text' class='solr.TextField' >
    <analyzer type="index" >
      <charFilter class="solr.HTMLStripCharFilterFactory" />
      <tokenizer class="solr.StandardTokenizerFactory" />
      <filter class="solr.StopFilterFactory"
        ignoreCase="true"
        words="stopwords.txt" />
      <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
  </fieldtype>
  <fieldtype name='int' class='solr.TrieIntField' />
</types>

<fields>
    <field name='podcast_id' indexed="true" stored="true" type='int' required="true"/>
    <field name='podcast_name' indexed="true" stored="true" type='text' />
    <field name='podcast_desc' indexed="true" stored="true" type='text' />
    <field name='podcast_keywords' indexed="true" stored="true" type='text' />
</fields>

<uniqueKey>podcast_id</uniqueKey>

<entity name="podcast" query="select podcast_id, podcast_name, podcast_desc, podcast_keywords from db.podcast"
  deltaQuery="select podcast_id from db.podcast where last_modified > '${dataimporter.last_index_time}'"
  deltaImportQuery="select podcast_id, podcast_name, podcast_desc, podcast_keywords from db.podcast where podcast_id='dataimporter.delta.podcast_id'">      
</entity>

你知道怎么回事吗？我想我在某个地方读到过，如果我使用的是“字符串”而不是“文本”字段类型，这种情况才会发生

更新1:

这是DIH更新的日志-看起来它没有处理任何文档。你知道为什么会这样吗

<response>

<lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">465</int>
</lst>
<lst name="initArgs">
    <lst name="defaults">
        <str name="config">podcastDIHconfigfile.xml</str>
    </lst>
</lst>
<str name="command">delta-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
    <str name="Total Requests made to DataSource">22</str>
    <str name="Total Rows Fetched">21</str>
    <str name="Total Documents Processed">0</str>
    <str name="Total Documents Skipped">0</str>
    <str name="Delta Dump started">2016-01-06 16:14:35</str>
    <str name="Identifying Delta">2016-01-06 16:14:35</str>
    <str name="Deltas Obtained">2016-01-06 16:14:36</str>
    <str name="Building documents">2016-01-06 16:14:36</str>
    <str name="Total Changed Documents">21</str>
    <str name="Time taken">0:0:0.368</str>
</lst>


0
465
podcastDIHconfigfile.xml
三角洲进口
闲置的
22
21
0
0
2016-01-06 16:14:35
2016-01-06 16:14:35
2016-01-06 16:14:36
2016-01-06 16:14:36
21
0:0:0.368

您在哪里检查现场？在

字段值（索引）

或

字段值（查询）

中？您是否看到任何标记？使用模式浏览器，我选择一个字段，然后加载术语信息-它给出的顶部术语不是标记，而是变量的全文。似乎没有创建任何令牌。您没有使用架构浏览器检查令牌。您可以使用分析页面，该页面应位于

yourSolrUrl:8983/solr/#/yourCoreName/Analysis

。尝试根据分析页面上的索引检查搜索词。如果看到标记，请尝试在同一页上将其作为查询进行检查。在架构浏览器上看到的不是索引的实际内容，而是原始文本。Solr出于不同的原因保留了这两个术语。是否要查看索引。@TMBT我认为架构浏览器允许您查看索引中最常用的术语？我想你可以用它来看看你应该在stopwords文件中放些什么？你在哪里检查这个字段？在

字段值（索引）

或

字段值（查询）

yourSolrUrl:8983/solr/#/yourCoreName/Analysis