Solr 如何使用Nutch索引NFS装载？_Solr_Nutch_Nfs

Solr 如何使用Nutch索引NFS装载？

solr

Solr 如何使用Nutch索引NFS装载？,solr,nutch,nfs,Solr,Nutch,Nfs,我正在尝试构建一个托管在CentOS 7机器上的搜索工具，它应该索引和搜索已装载NFS导出的目录。我发现Nutch+Solr是最好的选择。我很难为此配置url，因为它不会搜索任何http位置支架位于/mnt上因此我的seeds.txt如下所示： [root@sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt file:///mnt 我的regex-urlfilter.txt具有相同的站点和允许文件协议 # skip

我正在尝试构建一个托管在CentOS 7机器上的搜索工具，它应该索引和搜索已装载NFS导出的目录。我发现Nutch+Solr是最好的选择。我很难为此配置url，因为它不会搜索任何http位置

支架位于/mnt上

因此我的seeds.txt如下所示：

[root@sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt
file:///mnt

我的regex-urlfilter.txt具有相同的站点和允许文件协议

# skip file: ftp: and mailto: urls
-^(http|https|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^file:///mnt

但是，当我尝试从初始种子列表引导时，没有完成任何更新：

[root@sauron apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-06-12 00:07:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38

我还尝试将seeds.txt更改为以下内容，但没有成功：

file:/mnt
file:////<IP>:<export_path>

文件：/mnt
file:////:

如果我在这里做错了，请告诉我。

从URI的角度来看，Nutch的文件系统并没有那么大的不同，您只需要启用

协议文件

插件，并配置

regex urlfilter.txt

如下：

+^file:///mnt/directory/
-.

在这种情况下，您将阻止它对指定目录的父目录进行索引

请记住，由于您已经在本地挂载了NFS共享，因此它可以像普通的本地文件系统一样工作。更多信息可以在中找到。

从URI的角度来看，文件系统对于Nutch来说并没有什么不同，您只需要启用

协议文件

插件，并配置

regex urlfilter.txt

如下：

+^file:///mnt/directory/
-.

在这种情况下，您将阻止它对指定目录的父目录进行索引

请记住，由于您已经在本地挂载了NFS共享，因此它可以像普通的本地文件系统一样工作。更多信息请参见。

Injector:Total URL被筛选器拒绝：1

这意味着某些URL筛选器阻止了您的URL，您能否删除/注释此行

-.*（/[^/]+）/[^/]+\1/[^/]+\1/

，然后重试？否则，请将您的规则移动到文件的顶部，以避免首先命中阻止规则。请尝试将URL筛选规则更改为

+^file:/mnt/directory/

（仅一个斜杠

文件：/

），请参阅。我将更新教程以反映这一血淋淋的细节。@SebastianNagel解决了我的问题：#bin/nutch inject crawl/crawdb url Injector:从2017-06-14 12:09:41开始Injector:crawdb:crawl/crawdb Injector:urlDir:url Injector:将注入的URL转换为crawl db条目。Injector:覆盖：false Injector:更新：false Injector:筛选器拒绝的URL总数：0 Injector:规范化和筛选后注入的URL总数：1 Injector:注入的URL总数但已在爬网中DB:0 Injector:注入的新URL总数：1 Injector:在2017-06-14 12:12:22完成，时间：00:02:40非常感谢。我一直在关注您提供的相同文件。我已经按照您的建议进行了更改：My nutch-site.xml已经有了协议文件插件：[root@sauronapache-nutch-1.13]#cat/root/Desktop/apache-nutch-1.13/conf/nutch-site.xml | grep value My nutch Spider协议文件| urlfilter regex | parse-（html | tika）| index-（basic | anchor）|评分| urlnormalizer-（pass | regex | basic）我还添加了一行代码来阻止父目录搜索。但URL仍被拒绝：如果您检查日志，尤其是

Injector:Total URL被筛选器拒绝：1

这意味着某些URL筛选器阻止了您的URL，您能否删除/注释此行

-.*（/[^/]+）/[^/]+\1/[^/]+\1/

，然后重试？否则，请将您的规则移动到文件的顶部，以避免首先命中阻止规则。请尝试将URL筛选规则更改为

+^file:/mnt/directory/

（仅一个斜杠

文件：/