Indexing Solr无法搜索nutch爬网条目,尽管字段被签名为index=true

Indexing Solr无法搜索nutch爬网条目,尽管字段被签名为index=true,indexing,solr,web-crawler,nutch,Indexing,Solr,Web Crawler,Nutch,我正在运行Nutch1.16爬虫实例和Solr版本8.3.0。我已经能够在本地目录中抓取文件,并编辑nutch site.xml,从运行bin/crawl-s url dircrawl 2>和dircrawl.log的文件中提取一些元数据(尽管没有我希望的那么多)。然后,爬网数据通过bin/nutch index dircrawl/crawdb/-linkdb dircrawl/linkdb/-dir dircrawl/segments/-filter-normalize发送到Solr,然后通过

我正在运行Nutch1.16爬虫实例和Solr版本8.3.0。我已经能够在本地目录中抓取文件,并编辑
nutch site.xml
,从运行
bin/crawl-s url dircrawl 2>和dircrawl.log的文件中提取一些元数据(尽管没有我希望的那么多)。然后,爬网数据通过
bin/nutch index dircrawl/crawdb/-linkdb dircrawl/linkdb/-dir dircrawl/segments/-filter-normalize
发送到Solr,然后通过标签存储和管理条目

现在,从UI运行Solr Admin,我正在尝试搜索数据。我确保所有我感兴趣的条目都以
index=true
的形式签名。但是,运行除
*:*
以外的任何搜索都将返回零结果。我尝试了所有可能的搜索字段组合,也没有骰子。我将链接到配置文件的描述,首先是solr,然后是nutch

schema.xml (becomes managed-schema when running it, for some reason)

<?xml version="1.0" encoding="UTF-8"?>
<schema name="nutch-crawler-indexing-config" version="1.6">
  <uniqueKey>id</uniqueKey>
  <fieldType name="_nest_path_" class="solr.NestPathField" omitTermFreqAndPositions="true" omitNorms="true" maxCharsForDocValues="-1" stored="false"/>
  <fieldType name="ancestor_path" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
  </fieldType>
  (all fieldTypes are the default ones)
  <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.CJKBigramFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_gl" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_gl.txt" ignoreCase="true"/>
      <filter class="solr.GalicianStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hi" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.IndicNormalizationFilterFactory"/>
      <filter class="solr.HindiNormalizationFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_hi.txt" ignoreCase="true"/>
      <filter class="solr.HindiStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hu" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_hu.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Hungarian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_hy" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_hy.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Armenian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_id" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_id.txt" ignoreCase="true"/>
      <filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_it" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_it.txt" ignoreCase="true"/>
      <filter class="solr.ItalianLightStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
      <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.KoreanTokenizerFactory" outputUnknownUnigrams="false" decompoundMode="discard"/>
      <filter class="solr.KoreanPartOfSpeechStopFilterFactory"/>
      <filter class="solr.KoreanReadingFormFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_lv" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_lv.txt" ignoreCase="true"/>
      <filter class="solr.LatvianStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_nl.txt" ignoreCase="true"/>
      <filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_no.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" ignoreCase="true"/>
      <filter class="solr.PortugueseLightStemFilterFactory"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ro" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ro.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Romanian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_ru.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_sv.txt" ignoreCase="true"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Swedish"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.ThaiTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_th.txt" ignoreCase="true"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.TurkishLowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_tr.txt" ignoreCase="false"/>
      <filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
    </analyzer>
  </fieldType>
  <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
  </fieldType>
  <field name="_nest_path_" type="_nest_path_"/>
  <field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
  <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
  <field name="_version_" type="plong" indexed="false" stored="false"/>
  <field name="boost" type="pdoubles"/>
  <field name="content" type="text_general"/>
  <field name="digest" type="text_general"/>
  <field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
  <field name="metatag.author" type="text_general" indexed="true"/>
  <field name="metatag.channels" type="plongs"/>
  <field name="metatag.creator" type="text_general" indexed="true"/>
  <field name="metatag.samplerate" type="plongs"/>
  <field name="metatag.version" type="text_general"/>
  <field name="title" type="text_general" indexed="true"/>
  <field name="tstamp" type="pdates"/>
  <field name="url" type="text_general" stored="true"/>
  <dynamicField name="*_txt_en_split_tight" type="text_en_splitting_tight" indexed="true" stored="true"/>
  <dynamicField name="*_descendent_path" type="descendent_path" indexed="true" stored="true"/>
  <dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/>
  <dynamicField name="*_txt_en_split" type="text_en_splitting" indexed="true" stored="true"/>
  <dynamicField name="*_txt_sort" type="text_gen_sort" indexed="true" stored="true"/>
  <dynamicField name="ignored_*" type="ignored"/>
  <dynamicField name="*_txt_rev" type="text_general_rev" indexed="true" stored="true"/>
  <dynamicField name="*_phon_en" type="phonetic_en" indexed="true" stored="true"/>
  <dynamicField name="*_s_lower" type="lowercase" indexed="true" stored="true"/>
  <dynamicField name="*_txt_cjk" type="text_cjk" indexed="true" stored="true"/>
  <dynamicField name="random_*" type="random"/>
  <dynamicField name="*_t_sort" type="text_gen_sort" multiValued="false" indexed="true" stored="true"/>
  <dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ar" type="text_ar" indexed="true" stored="true"/>
  <dynamicField name="*_txt_bg" type="text_bg" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ca" type="text_ca" indexed="true" stored="true"/>
  <dynamicField name="*_txt_cz" type="text_cz" indexed="true" stored="true"/>
  <dynamicField name="*_txt_da" type="text_da" indexed="true" stored="true"/>
  <dynamicField name="*_txt_de" type="text_de" indexed="true" stored="true"/>
  <dynamicField name="*_txt_el" type="text_el" indexed="true" stored="true"/>
  <dynamicField name="*_txt_es" type="text_es" indexed="true" stored="true"/>
  <dynamicField name="*_txt_et" type="text_et" indexed="true" stored="true"/>
  <dynamicField name="*_txt_eu" type="text_eu" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fa" type="text_fa" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fi" type="text_fi" indexed="true" stored="true"/>
  <dynamicField name="*_txt_fr" type="text_fr" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ga" type="text_ga" indexed="true" stored="true"/>
  <dynamicField name="*_txt_gl" type="text_gl" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hi" type="text_hi" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hu" type="text_hu" indexed="true" stored="true"/>
  <dynamicField name="*_txt_hy" type="text_hy" indexed="true" stored="true"/>
  <dynamicField name="*_txt_id" type="text_id" indexed="true" stored="true"/>
  <dynamicField name="*_txt_it" type="text_it" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ja" type="text_ja" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ko" type="text_ko" indexed="true" stored="true"/>
  <dynamicField name="*_txt_lv" type="text_lv" indexed="true" stored="true"/>
  <dynamicField name="*_txt_nl" type="text_nl" indexed="true" stored="true"/>
  <dynamicField name="*_txt_no" type="text_no" indexed="true" stored="true"/>
  <dynamicField name="*_txt_pt" type="text_pt" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ro" type="text_ro" indexed="true" stored="true"/>
  <dynamicField name="*_txt_ru" type="text_ru" indexed="true" stored="true"/>
  <dynamicField name="*_txt_sv" type="text_sv" indexed="true" stored="true"/>
  <dynamicField name="*_txt_th" type="text_th" indexed="true" stored="true"/>
  <dynamicField name="*_txt_tr" type="text_tr" indexed="true" stored="true"/>
  <dynamicField name="*_point" type="point" indexed="true" stored="true"/>
  <dynamicField name="*_srpt" type="location_rpt" indexed="true" stored="true"/>
  <dynamicField name="attr_*" type="text_general" multiValued="true" indexed="true" stored="true"/>
  <dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/>
  <dynamicField name="*_str" type="strings" docValues="true" indexed="false" stored="false" useDocValuesAsStored="false"/>
  <dynamicField name="*_dts" type="pdate" multiValued="true" indexed="true" stored="true"/>
  <dynamicField name="*_dpf" type="delimited_payloads_float" indexed="true" stored="true"/>
  <dynamicField name="*_dpi" type="delimited_payloads_int" indexed="true" stored="true"/>
  <dynamicField name="*_dps" type="delimited_payloads_string" indexed="true" stored="true"/>
  <dynamicField name="*_is" type="pints" indexed="true" stored="true"/>
  <dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
  <dynamicField name="*_ls" type="plongs" indexed="true" stored="true"/>
  <dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/>
  <dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
  <dynamicField name="*_ds" type="pdoubles" indexed="true" stored="true"/>
  <dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/>
  <dynamicField name="*_ws" type="text_ws" indexed="true" stored="true"/>
  <dynamicField name="*_i" type="pint" indexed="true" stored="true"/>
  <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_l" type="plong" indexed="true" stored="true"/>
  <dynamicField name="*_t" type="text_general" multiValued="false" indexed="true" stored="true"/>
  <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
  <dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/>
  <dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/>
  <dynamicField name="*_p" type="location" indexed="true" stored="true"/>
  <copyField source="digest" dest="digest_str" maxChars="256"/>
  <copyField source="title" dest="title_str" maxChars="256"/>
  <copyField source="url" dest="url_str" maxChars="256"/>
  <copyField source="content" dest="content_str" maxChars="256"/>
  <copyField source="metatag.author" dest="metatag.author_str" maxChars="256"/>
  <copyField source="metatag.version" dest="metatag.version_str" maxChars="256"/>
  <copyField source="metatag.creator" dest="metatag.creator_str" maxChars="256"/>
</schema>
运行任何其他类型查询的响应:

{
  "responseHeader":{
    ...
    "params":{
      "q":"Bumblebee",
      "_":"..."}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}
此外,我试图索引的数据是免费音乐档案中的各种.mp3文件

编辑:我试图搜索的文件如下所示:

  {
        "metatag.author":["A Kombi",
          "A Kombi"],
        "metatag.samplerate":[44100,
          44100],
        "title":["Plight Of The Bumblebee"],
        "url":["file:/c:/Users/.../fma/fma_small/009/009476.mp3"],
        "content":["Plight Of The Bumblebee\nPlight Of The Bumblebee\nA Kombi\nMusic to Drive By, track 2\n2004-09-14T00:00:00\nField Recordings\n30014.912\n"],
        "metatag.creator":["A Kombi",
          "A Kombi"],
        "tstamp":["2020-04-02T15:26:29.507Z"],
        "digest":["ddd4ab2288c5799a5646592e1a63437f"],
        "boost":[0.20851442],
        "id":"file:/c:/Users/.../fma/fma_small/009/009476.mp3",
        "metatag.version":["MPEG 3 Layer III Version 1",
          "MPEG 3 Layer III Version 1"],
        "metatag.channels":[2,
          2],
        "_version_":1662875102548590596}

您必须设置要搜索的字段-除非您选择。在较旧版本的schema.xml中,可以为schema配置此选项,但推荐的方法是在查询本身中配置它

但是,要支持自由文本搜索,最好使用
edismax
查询解析器,提供
defType=edismax
,然后通过
qf
(查询字段)参数设置要搜索的字段

q=Bumblebee&qf=title&defType=edismax
。。将在
标题
字段中搜索大黄蜂。您还可以为
qf
指定多个字段名,还可以调整为每个字段指定的权重:

qf=title^10 content
。。它将在
标题
内容
中搜索,并且与
内容
字段中的点击相比,
标题
字段中的任何点击的权重要高出十倍


fl
(字段列表)参数调整响应中返回的字段,如果您只需要可用字段的一小部分(例如仅id),则此参数非常有用为了避免更大的响应,并且必须为每个返回的文档从磁盘加载所有字段值。

在搜索
bumblebee
时,您希望返回什么文档?在使用Lucene查询解析器时,通常应该提供字段名,即
fieldname:bumblebee
。如果要使用纯文本搜索,请在查询中附加
defType=edismax&qf=fieldname\u to\u search\u,以使用edismax查询解析器。我正在尝试检索nutch爬网后索引的mp3文件的元数据描述。我已经更新了问题,以包括文件的外观。我已经向solr发送了一个POST命令,将“title”更改为索引。我确保该字段全部更新(我知道过账到solr会替换整个字段),但它仍然不允许我搜索“集合中的每个文档”以外的任何内容。
qf
是您要查询的字段。因此,在第一个示例中,您必须添加
qf=title
以在title字段中进行搜索。在另外两个示例中,您尝试搜索名为
Bumblebee
-的字段,而不进行查询
qf
代表“查询字段”-即要查询的字段。否,
fl
用于设置响应中返回的字段-而不是正在搜索的字段。原因可能是有一条规则可以将所有内容复制到后台的公共字段中,默认模式中的公共字段是
\u text
。然后,默认情况下,该字段也被配置为lucene查询解析器的默认搜索字段。但是,当您使用其他架构时,默认架构中的这些设置将不再存在。部分匹配将取决于您要查询的字段的字段类型。如果是
text\u general
,则只返回与单词匹配的内容。如果是
text\u ngram
或任何其他带有ngram的默认类型(如果您使用的是默认模式-这些内容在每个用例的模式中定义),您也会得到部分命中。
q=Bumblebee&qf=title&defType=edismax
qf=title^10 content