从solr中的tika解析器获取html文件的文本
我正在尝试使用Solr6自动索引html文件。solrconfig.xml文件如下所示:从solr中的tika解析器获取html文件的文本,html,solr,apache-tika,Html,Solr,Apache Tika,我正在尝试使用Solr6自动索引html文件。solrconfig.xml文件如下所示: <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="lowernames">true<
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
是一个标签,Solr将其解释为字段流大小
,该标签的内容
将被视为值?为什么html中的文本不在任何此类标记中?示例techproducts配置集包括
'一个copyField指令,它使所有内容在预定义的“全部捕获”文本字段中建立索引,以启用包含所有字段'内容'的单字段搜索。'
也许您可以将您的配置与techproducts配置进行比较,更好地理解它。否则,您将需要显示更多配置
是的,很明显,您将得到一个名为stream_size的Solr字段,其值为869。但正如您所拥有的,您拥有“extract_only”,它只解析文件,而不为其编制索引
{
'responseHeader'=>{
'status'=>0,
'QTime'=>12},
''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta
name="stream_size" content="869"/>
<meta name="X-Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>
<meta
name="X-Parsed-By"
content="org.apache.tika.parser.html.HtmlParser"/>
<meta
name="stream_content_type" content="text/html"/>
<meta name="dc:title"
content="System Requirements"/>
<meta
name="Content-Encoding" content="UTF-8"/>
<meta name="Content-Type-Hint"
content="text/html; charset=UTF-8"/>
<meta
name="resourceName"
content="/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html"/>
<meta
name="Content-Type"
content="text/html; charset=UTF-8"/>
<title>System Requirements</title>
</head>
<body>
<h1>System Requirements</h1>
<p>Apache Solr runs on Java 8 or greater.</p>
<p>It is also recommended to always use the latest update version of your Java VM, because bugs may affect Solr. An overview of known JVM bugs can be found on <a
shape="rect" href="http://wiki.apache.org/lucene-java/JavaBugs">http://wiki.apache.org/lucene-java/JavaBugs</a>
</p>
<p>With all Java versions it is strongly recommended to not use experimental <code>-XX</code> JVM options.</p>
<p>CPU, disk and memory requirements are based on the many choices made in implementing Solr (document size, number of documents, and number of hits retrieved to name a few). The benchmarks page has some information related to performance on particular platforms. </p>
</body>
</html>
',
'null_metadata'=>[
'stream_size',['869'],
'X-Parsed-By',['org.apache.tika.parser.DefaultParser',
'org.apache.tika.parser.html.HtmlParser'],
'stream_content_type',['text/html'],
'dc:title',['System Requirements'],
'Content-Encoding',['UTF-8'],
'Content-Type-Hint',['text/html; charset=UTF-8'],
'resourceName',['/home/szr163/search441/indexer/solr-6.3.0/docs/SYSTEM_REQUIREMENTS.html'],
'title',['System Requirements'],
'Content-Type',['text/html; charset=UTF-8']]}