Plugins 激活nutch标题插件的问题
我试图激活nutch 1.8中的标题插件,但不知何故它不起作用。以下是my nutch-site.xml的部分内容:Plugins 激活nutch标题插件的问题,plugins,nutch,Plugins,Nutch,我试图激活nutch 1.8中的标题插件,但不知何故它不起作用。以下是my nutch-site.xml的部分内容: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|headings)|index-(basic|anchor|metadata)|scoring-opic|url
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags|headings)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>activates metatag parsing </description>
</property>
<property>
<name>headings</name>
<value>h1;h2</value>
<description>Comma separated list of headings to retrieve from the document</description>
</property>
<property>
<name>headings.multivalued</name>
<value>false</value>
<description>Whether to support multivalued headings.</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.title, metatag.keywords, metatag.author,
metatag.author, headings.h1, headings.h2</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields. Can be used e.g. for 'description' or 'keywords' provided that these values are generated by a parser (see parse-metatags plugin)
</description>
</property>
有人能帮忙吗
谢谢你,克里斯
<name>index.parse.md</name>
检查metatag.h1和metatag.h2
<property>
<name>index.parse.md</name>
<value>metatag.h1,metatag.h2/value>
...
顺便说一句,标题不是解析-。。。滤器
你必须使用
<name>plugin.includes</name>
<value>headings|parse-(html|tika|metatags)|...
现在它应该可以工作了…在我自己尝试了一下之后,我发现以下内容应该可以工作ApacheNutch1.9:
<property>
<name>plugin.includes</name>
<value>protocol-http|headings|parse-(html|tika|metatags)|...</value>
</property>
<property>
<name>index.parse.md</name>
<value>h1,h2,h3</value>
</property>
<property>
<name>headings</name>
<value>h1,h2,h3</value>
</property>
<property>
<name>headings.multivalued</name>
<value>true</value>
</property>
使用Apache Solr时,应将以下内容添加到schema.xml文件中:
<!-- fields for the headings plugin -->
<field name="h1" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h2" type="text" stored="true" indexed="true" multiValued="true"/>
<field name="h3" type="text" stored="true" indexed="true" multiValued="true"/>