Lucene 在Solr中索引PDF文档,无唯一键
我想索引PDF(和其他丰富的)文档。我正在使用DataImportHandler 下面是my schema.xml的外观:Lucene 在Solr中索引PDF文档,无唯一键,lucene,solr,Lucene,Solr,我想索引PDF(和其他丰富的)文档。我正在使用DataImportHandler 下面是my schema.xml的外观: ......... ......... <field name="title" type="text" indexed="true" stored="true" multiValued="false"/> <field name="description" type="text" indexed="true" stored="true" multi
.........
.........
<field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
<field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
<dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>
我是否需要为pdf文档创建一个名为link的字段
之前已经有人问过这个问题,但提供的解决方案使用ExtractRequestHandler,但我想通过DataImportHandler使用它 试试这个:
<entity name="fileItems" rootEntity="false" dataSource="dbSource" query="select path from file_paths">
<field column="path" name="link"/>
<entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
</entity>
</entity>
试试这个:
<entity name="fileItems" rootEntity="false" dataSource="dbSource" query="select path from file_paths">
<field column="path" name="link"/>
<entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
</entity>
</entity>
<entity name="fileItems" rootEntity="false" dataSource="dbSource" query="select path from file_paths">
<field column="path" name="link"/>
<entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
<field column="title" name="title" meta="true"/>
<field column="Creation-Date" name="date_published" meta="true"/>
</entity>
</entity>