Solr Tika获取存储在数据库中的二进制内容，但不为其编制索引_Solr_Binaryfiles_Apache Tika

Solr Tika获取存储在数据库中的二进制内容，但不为其编制索引

solr

Solr Tika获取存储在数据库中的二进制内容，但不为其编制索引,solr,binaryfiles,apache-tika,Solr,Binaryfiles,Apache Tika,我试图解析存储在数据库中的二进制内容数据，即列文件数据中的表document_attachment中的二进制内容数据，并尝试对其进行索引，以便使用Solr进行搜索。当我运行索引器时，它会将两倍于在名为“dcs”的实体中运行查询返回的行数的行数提取出来，并且不会抛出任何错误或异常。但是，它不索引二进制内容（尽管从表中获取了该字段，但我还是将该字段与tika关联）我正在使用apache-solr-3.6.1和Tika 1.0 我的配置文件如下所示： data-config.xml <?xm

我试图解析存储在数据库中的二进制内容数据，即列文件数据中的表document_attachment中的二进制内容数据，并尝试对其进行索引，以便使用Solr进行搜索。当我运行索引器时，它会将两倍于在名为“dcs”的实体中运行查询返回的行数的行数提取出来，并且不会抛出任何错误或异常。但是，它不索引二进制内容（尽管从表中获取了该字段，但我还是将该字段与tika关联）

我正在使用apache-solr-3.6.1和Tika 1.0

我的配置文件如下所示：

data-config.xml

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
 <dataSource
          driver="com.mysql.jdbc.Driver"
          url="jdbc:mysql://localhost/espritkm_1?zeroDateTimeBehavior=convertToNull"
          user="root"
          password=""
          autoCommit="true" batchSize="-1"
          convertType="false"
          name="test"
          />

  <dataSource name="fieldReader" type="FieldStreamDataSource" />
  <document name="items">
  <entity name="dcs"
          query="SELECT 222000000000000000+d.id AS common_id_attr,d.id AS id,UNIX_TIMESTAMP(d.created_at)  AS date_added,d.file_name as common1, d.description as common2, d.file_mime_type as common3, 72 as common4,(Select group_concat(trim(tags) ORDER BY trim(tags) SEPARATOR ' | ') from tags t where t.type_id = 72 AND t.feature_id = d.id group by t.feature_id) as common5,d.created_by as common6, df.name as common7,CONCAT(d.file_name,'.',d.file_mime_type) as common8,'' as common9,(Select da.file_data from document_attachment da where da.document_id = d.id) as text  FROM document d LEFT JOIN document_folder df ON df.id = d.document_folder_id  WHERE d.is_deleted = 0 and d.parent_id = 0 " dataSource="test" transformer="TemplateTransformer">

<field column="common_id_attr" name="common_id_attr" />
    <field column="id" name="id" />
        <entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="dcs.text" format="text" pk="dcs.id" >  

    <field column="text" name="text" />
   </entity>
 </entity>

schema.xml

   <schema>
    <fields> 
     <field name="common_id_attr" type="string" indexed="true" stored="true" multiValued="false"/>
     <field name="id" type="string" indexed="true" stored="true"/>
     <field name="text" type="text" indexed="true" stored="true" multiValued="true"/>
   </fields> 
  <uniqueKey>common_id_attr</uniqueKey>
  <solrQueryParser defaultOperator="OR"/>
  <defaultSearchField>text</defaultSearchField>
 </schema>


公共标识属性
文本

虽然它获取的行数是将tika的行数计算为单独的文档数的两倍（我不明白为什么？）。它不索引二进制内容

我长期以来一直被这个问题困扰着。有人能帮忙吗？

我能够使用ApacheSolr 3.6.2版为文档编制索引。我在这里描述了这些步骤：

我认为在3.6.1中也是可行的。我只是迫不及待地想搜索3.6.1版的tarball，而官方网站上只有3.6.2版

我希望这能有所帮助。

完成db导入后，字段文本中有什么内容？根本没有价值观？