Indexing Solr无法搜索nutch爬网条目,尽管字段被签名为index=true
我正在运行Nutch1.16爬虫实例和Solr版本8.3.0。我已经能够在本地目录中抓取文件,并编辑Indexing Solr无法搜索nutch爬网条目,尽管字段被签名为index=true,indexing,solr,web-crawler,nutch,Indexing,Solr,Web Crawler,Nutch,我正在运行Nutch1.16爬虫实例和Solr版本8.3.0。我已经能够在本地目录中抓取文件,并编辑nutch site.xml,从运行bin/crawl-s url dircrawl 2>和dircrawl.log的文件中提取一些元数据(尽管没有我希望的那么多)。然后,爬网数据通过bin/nutch index dircrawl/crawdb/-linkdb dircrawl/linkdb/-dir dircrawl/segments/-filter-normalize发送到Solr,然后通过
nutch site.xml
,从运行bin/crawl-s url dircrawl 2>和dircrawl.log的文件中提取一些元数据(尽管没有我希望的那么多)。然后,爬网数据通过bin/nutch index dircrawl/crawdb/-linkdb dircrawl/linkdb/-dir dircrawl/segments/-filter-normalize
发送到Solr,然后通过标签存储和管理条目
现在,从UI运行Solr Admin,我正在尝试搜索数据。我确保所有我感兴趣的条目都以index=true
的形式签名。但是,运行除*:*
以外的任何搜索都将返回零结果。我尝试了所有可能的搜索字段组合,也没有骰子。我将链接到配置文件的描述,首先是solr,然后是nutch
schema.xml (becomes managed-schema when running it, for some reason)
<?xml version="1.0" encoding="UTF-8"?>
<schema name="nutch-crawler-indexing-config" version="1.6">
<uniqueKey>id</uniqueKey>
<fieldType name="_nest_path_" class="solr.NestPathField" omitTermFreqAndPositions="true" omitNorms="true" maxCharsForDocValues="-1" stored="false"/>
<fieldType name="ancestor_path" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
</analyzer>
</fieldType>
(all fieldTypes are the default ones)
<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_general_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" maxPosQuestion="2" maxFractionAsterisk="0.33" maxPosAsterisk="3" withOriginal="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_gl" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_gl.txt" ignoreCase="true"/>
<filter class="solr.GalicianStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_hi" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.IndicNormalizationFilterFactory"/>
<filter class="solr.HindiNormalizationFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_hi.txt" ignoreCase="true"/>
<filter class="solr.HindiStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_hu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_hu.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Hungarian"/>
</analyzer>
</fieldType>
<fieldType name="text_hy" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_hy.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Armenian"/>
</analyzer>
</fieldType>
<fieldType name="text_id" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_id.txt" ignoreCase="true"/>
<filter class="solr.IndonesianStemFilterFactory" stemDerivational="true"/>
</analyzer>
</fieldType>
<fieldType name="text_it" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" articles="lang/contractions_it.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_it.txt" ignoreCase="true"/>
<filter class="solr.ItalianLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
<filter class="solr.JapaneseBaseFormFilterFactory"/>
<filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
<filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KoreanTokenizerFactory" outputUnknownUnigrams="false" decompoundMode="discard"/>
<filter class="solr.KoreanPartOfSpeechStopFilterFactory"/>
<filter class="solr.KoreanReadingFormFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_lv" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_lv.txt" ignoreCase="true"/>
<filter class="solr.LatvianStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_nl.txt" ignoreCase="true"/>
<filter class="solr.StemmerOverrideFilterFactory" dictionary="lang/stemdict_nl.txt" ignoreCase="false"/>
<filter class="solr.SnowballPorterFilterFactory" language="Dutch"/>
</analyzer>
</fieldType>
<fieldType name="text_no" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_no.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Norwegian"/>
</analyzer>
</fieldType>
<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_pt.txt" ignoreCase="true"/>
<filter class="solr.PortugueseLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_ro" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_ro.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Romanian"/>
</analyzer>
</fieldType>
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_ru.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
</analyzer>
</fieldType>
<fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_sv.txt" ignoreCase="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="Swedish"/>
</analyzer>
</fieldType>
<fieldType name="text_th" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ThaiTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_th.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<fieldType name="text_tr" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_tr.txt" ignoreCase="false"/>
<filter class="solr.SnowballPorterFilterFactory" language="Turkish"/>
</analyzer>
</fieldType>
<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
<field name="_nest_path_" type="_nest_path_"/>
<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="boost" type="pdoubles"/>
<field name="content" type="text_general"/>
<field name="digest" type="text_general"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="metatag.author" type="text_general" indexed="true"/>
<field name="metatag.channels" type="plongs"/>
<field name="metatag.creator" type="text_general" indexed="true"/>
<field name="metatag.samplerate" type="plongs"/>
<field name="metatag.version" type="text_general"/>
<field name="title" type="text_general" indexed="true"/>
<field name="tstamp" type="pdates"/>
<field name="url" type="text_general" stored="true"/>
<dynamicField name="*_txt_en_split_tight" type="text_en_splitting_tight" indexed="true" stored="true"/>
<dynamicField name="*_descendent_path" type="descendent_path" indexed="true" stored="true"/>
<dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/>
<dynamicField name="*_txt_en_split" type="text_en_splitting" indexed="true" stored="true"/>
<dynamicField name="*_txt_sort" type="text_gen_sort" indexed="true" stored="true"/>
<dynamicField name="ignored_*" type="ignored"/>
<dynamicField name="*_txt_rev" type="text_general_rev" indexed="true" stored="true"/>
<dynamicField name="*_phon_en" type="phonetic_en" indexed="true" stored="true"/>
<dynamicField name="*_s_lower" type="lowercase" indexed="true" stored="true"/>
<dynamicField name="*_txt_cjk" type="text_cjk" indexed="true" stored="true"/>
<dynamicField name="random_*" type="random"/>
<dynamicField name="*_t_sort" type="text_gen_sort" multiValued="false" indexed="true" stored="true"/>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<dynamicField name="*_txt_ar" type="text_ar" indexed="true" stored="true"/>
<dynamicField name="*_txt_bg" type="text_bg" indexed="true" stored="true"/>
<dynamicField name="*_txt_ca" type="text_ca" indexed="true" stored="true"/>
<dynamicField name="*_txt_cz" type="text_cz" indexed="true" stored="true"/>
<dynamicField name="*_txt_da" type="text_da" indexed="true" stored="true"/>
<dynamicField name="*_txt_de" type="text_de" indexed="true" stored="true"/>
<dynamicField name="*_txt_el" type="text_el" indexed="true" stored="true"/>
<dynamicField name="*_txt_es" type="text_es" indexed="true" stored="true"/>
<dynamicField name="*_txt_et" type="text_et" indexed="true" stored="true"/>
<dynamicField name="*_txt_eu" type="text_eu" indexed="true" stored="true"/>
<dynamicField name="*_txt_fa" type="text_fa" indexed="true" stored="true"/>
<dynamicField name="*_txt_fi" type="text_fi" indexed="true" stored="true"/>
<dynamicField name="*_txt_fr" type="text_fr" indexed="true" stored="true"/>
<dynamicField name="*_txt_ga" type="text_ga" indexed="true" stored="true"/>
<dynamicField name="*_txt_gl" type="text_gl" indexed="true" stored="true"/>
<dynamicField name="*_txt_hi" type="text_hi" indexed="true" stored="true"/>
<dynamicField name="*_txt_hu" type="text_hu" indexed="true" stored="true"/>
<dynamicField name="*_txt_hy" type="text_hy" indexed="true" stored="true"/>
<dynamicField name="*_txt_id" type="text_id" indexed="true" stored="true"/>
<dynamicField name="*_txt_it" type="text_it" indexed="true" stored="true"/>
<dynamicField name="*_txt_ja" type="text_ja" indexed="true" stored="true"/>
<dynamicField name="*_txt_ko" type="text_ko" indexed="true" stored="true"/>
<dynamicField name="*_txt_lv" type="text_lv" indexed="true" stored="true"/>
<dynamicField name="*_txt_nl" type="text_nl" indexed="true" stored="true"/>
<dynamicField name="*_txt_no" type="text_no" indexed="true" stored="true"/>
<dynamicField name="*_txt_pt" type="text_pt" indexed="true" stored="true"/>
<dynamicField name="*_txt_ro" type="text_ro" indexed="true" stored="true"/>
<dynamicField name="*_txt_ru" type="text_ru" indexed="true" stored="true"/>
<dynamicField name="*_txt_sv" type="text_sv" indexed="true" stored="true"/>
<dynamicField name="*_txt_th" type="text_th" indexed="true" stored="true"/>
<dynamicField name="*_txt_tr" type="text_tr" indexed="true" stored="true"/>
<dynamicField name="*_point" type="point" indexed="true" stored="true"/>
<dynamicField name="*_srpt" type="location_rpt" indexed="true" stored="true"/>
<dynamicField name="attr_*" type="text_general" multiValued="true" indexed="true" stored="true"/>
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_str" type="strings" docValues="true" indexed="false" stored="false" useDocValuesAsStored="false"/>
<dynamicField name="*_dts" type="pdate" multiValued="true" indexed="true" stored="true"/>
<dynamicField name="*_dpf" type="delimited_payloads_float" indexed="true" stored="true"/>
<dynamicField name="*_dpi" type="delimited_payloads_int" indexed="true" stored="true"/>
<dynamicField name="*_dps" type="delimited_payloads_string" indexed="true" stored="true"/>
<dynamicField name="*_is" type="pints" indexed="true" stored="true"/>
<dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
<dynamicField name="*_ls" type="plongs" indexed="true" stored="true"/>
<dynamicField name="*_bs" type="booleans" indexed="true" stored="true"/>
<dynamicField name="*_fs" type="pfloats" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="pdoubles" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="pdate" indexed="true" stored="true"/>
<dynamicField name="*_ws" type="text_ws" indexed="true" stored="true"/>
<dynamicField name="*_i" type="pint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="plong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text_general" multiValued="false" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="pfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="pdouble" indexed="true" stored="true"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true"/>
<copyField source="digest" dest="digest_str" maxChars="256"/>
<copyField source="title" dest="title_str" maxChars="256"/>
<copyField source="url" dest="url_str" maxChars="256"/>
<copyField source="content" dest="content_str" maxChars="256"/>
<copyField source="metatag.author" dest="metatag.author_str" maxChars="256"/>
<copyField source="metatag.version" dest="metatag.version_str" maxChars="256"/>
<copyField source="metatag.creator" dest="metatag.creator_str" maxChars="256"/>
</schema>
运行任何其他类型查询的响应:
{
"responseHeader":{
...
"params":{
"q":"Bumblebee",
"_":"..."}},
"response":{"numFound":0,"start":0,"docs":[]
}}
此外,我试图索引的数据是免费音乐档案中的各种.mp3文件
编辑:我试图搜索的文件如下所示:
{
"metatag.author":["A Kombi",
"A Kombi"],
"metatag.samplerate":[44100,
44100],
"title":["Plight Of The Bumblebee"],
"url":["file:/c:/Users/.../fma/fma_small/009/009476.mp3"],
"content":["Plight Of The Bumblebee\nPlight Of The Bumblebee\nA Kombi\nMusic to Drive By, track 2\n2004-09-14T00:00:00\nField Recordings\n30014.912\n"],
"metatag.creator":["A Kombi",
"A Kombi"],
"tstamp":["2020-04-02T15:26:29.507Z"],
"digest":["ddd4ab2288c5799a5646592e1a63437f"],
"boost":[0.20851442],
"id":"file:/c:/Users/.../fma/fma_small/009/009476.mp3",
"metatag.version":["MPEG 3 Layer III Version 1",
"MPEG 3 Layer III Version 1"],
"metatag.channels":[2,
2],
"_version_":1662875102548590596}
您必须设置要搜索的字段-除非您选择。在较旧版本的schema.xml中,可以为schema配置此选项,但推荐的方法是在查询本身中配置它
但是,要支持自由文本搜索,最好使用edismax
查询解析器,提供defType=edismax
,然后通过qf
(查询字段)参数设置要搜索的字段
q=Bumblebee&qf=title&defType=edismax
。。将在标题
字段中搜索大黄蜂。您还可以为qf
指定多个字段名,还可以调整为每个字段指定的权重:
qf=title^10 content
。。它将在标题
和内容
中搜索,并且与内容
字段中的点击相比,标题
字段中的任何点击的权重要高出十倍
fl
(字段列表)参数调整响应中返回的字段,如果您只需要可用字段的一小部分(例如仅id),则此参数非常有用为了避免更大的响应,并且必须为每个返回的文档从磁盘加载所有字段值。在搜索bumblebee
时,您希望返回什么文档?在使用Lucene查询解析器时,通常应该提供字段名,即fieldname:bumblebee
。如果要使用纯文本搜索,请在查询中附加defType=edismax&qf=fieldname\u to\u search\u,以使用edismax查询解析器。我正在尝试检索nutch爬网后索引的mp3文件的元数据描述。我已经更新了问题,以包括文件的外观。我已经向solr发送了一个POST命令,将“title”更改为索引。我确保该字段全部更新(我知道过账到solr会替换整个字段),但它仍然不允许我搜索“集合中的每个文档”以外的任何内容。qf
是您要查询的字段。因此,在第一个示例中,您必须添加qf=title
以在title字段中进行搜索。在另外两个示例中,您尝试搜索名为Bumblebee
-的字段,而不进行查询qf
代表“查询字段”-即要查询的字段。否,fl
用于设置响应中返回的字段-而不是正在搜索的字段。原因可能是有一条规则可以将所有内容复制到后台的公共字段中,默认模式中的公共字段是\u text
。然后,默认情况下,该字段也被配置为lucene查询解析器的默认搜索字段。但是,当您使用其他架构时,默认架构中的这些设置将不再存在。部分匹配将取决于您要查询的字段的字段类型。如果是text\u general
,则只返回与单词匹配的内容。如果是text\u ngram
或任何其他带有ngram的默认类型(如果您使用的是默认模式-这些内容在每个用例的模式中定义),您也会得到部分命中。
q=Bumblebee&qf=title&defType=edismax
qf=title^10 content