Regex 如何使nutch抓取文件和子文件夹-它只抓取文件夹的索引_Regex_Solr_Nutch_Web Crawler

Regex 如何使nutch抓取文件和子文件夹-它只抓取文件夹的索引

regex solr web-crawler

Regex 如何使nutch抓取文件和子文件夹-它只抓取文件夹的索引,regex,solr,nutch,web-crawler,Regex,Solr,Nutch,Web Crawler,编辑：我找到了我的答案并写在下面，但把奖金给了塔哈，因为他提供了一些好的建议我正在设置nutch来抓取本地文件夹（samba挂载）。我遵循了教程我的文件夹如下所示： nutch@ubuntu:~$ ls /mnt/ntserver/ expansion.docx test-folder test-shared.txt 还有下面的一些文件和文件夹测试文件夹当我运行nutch时，它不会索引文件或子文件夹。它只将单个文档放入solr，solr是文件夹的索引。这是我在一个空的solr索引上

编辑：我找到了我的答案并写在下面，但把奖金给了塔哈，因为他提供了一些好的建议

我正在设置nutch来抓取本地文件夹（samba挂载）。我遵循了教程

我的文件夹如下所示：

nutch@ubuntu:~$ ls /mnt/ntserver/
expansion.docx  test-folder  test-shared.txt

还有下面的一些文件和文件夹

测试文件夹

当我运行nutch时，它不会索引文件或子文件夹。它只将单个文档放入solr，solr是文件夹的索引。这是我在一个空的solr索引上运行nutch后在solr中得到的结果：

"response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "content": [
          "Index of /mnt/ntserver Index of /mnt/ntserver ../ - - - expansion.docx Mon, 30 Dec 2013 14:00:42 GMT 70524 test-folder/ Fri, 17 Jan 2014 09:38:50 GMT - test-shared.txt Thu, 16 Jan 2014 11:33:42 GMT 16"
        ],
      .....

如何让nutch为文件和子文件夹编制索引

编辑：如果我将regex-urlfilter设置为允许所有内容（在过滤GIF、http等之后）像这样

+。

，那么nutch似乎会向上移动文件夹层次结构，但不会向下移动，并且仍然只爬行索引，而不是文件。这就是我在solr中得到的：

"response": {
    "numFound": 26,
    "start": 0,
    "docs": [
      {
        "title": [
          "Index of /"
        ]
      },
      {
        "title": [
          "Index of /bin"
        ]
      },
      ...
      {
        "title": [
          "Index of /mnt"
        ]
      },
      {
        "title": [
          "Index of /mnt/ntserver"
        ]
      },
      ...
    ]

其他信息：

这是我使用的爬网命令：

apache-nutch-1.7/bin/nutch crawl -dir fileCrawl -urls apache-nutch-1.7/urls/ -solr http://localhost:8983/solr -depth 3 -topN 10000

这是我的种子URL文件的内容：

nutch@ubuntu:~$ cat apache-nutch-1.7/urls/urls_to_be_crawled.txt 
file:////mnt/ntserver

这是我的regex-urlfilter.xml：

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/regex-urlfilter.txt
# skip http: ftp: and mailto: urls
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY|cs|CS|dll|DLL|refresh|REFRESH)$

# accept any files
+.*mnt/ntserver.*

我在nutch-site.xml中包含了

协议文件

，并且没有对文件大小设置限制：

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/nutch-site.xml
...
<property>
    <name>plugin.includes</name>
    <value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|index-more<!--|remove-empty-document|title-adder--></value>
    <description></description>
</property>

<property>
    <name>file.content.limit</name>
    <value>-1</value>
    <description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>

...

nutch@ubuntu：~$cat apache-nutch-1.7/conf/nutch-site.xml
...
plugin.includes
协议文件| urlfilter regex | parse-（html | tika | text）| index-（basic | anchor）| indexer solr | scoring opic | urlnormalizer-（pass | regex | basic）|索引更多
file.content.limit
-1
需要停止缓冲区溢出错误-无法读取。。。。。
...

我已经注释掉了regex-normalize.xml中删除的重复斜杠：

nutch@ubuntu:~$ cat apache-nutch-1.7/conf/regex-normalize.xml
...
<!-- removes duplicate slashes - commented out, so we won't get invalid filenames 
<regex>
    <pattern>(?&lt;!:)/{2,}</pattern>
    <substitution>/</substitution>
</regex>
-->
...

nutch@ubuntu：~$cat apache-nutch-1.7/conf/regex-normalize.xml
...
...

在调查文件和文件响应源时，我发现以下内容：

有一个名为“file.crawl.parent”的配置参数，它控制nutch是否也应该对目录的父目录进行爬网。默认情况下，这是真的

在这个实现中，当nutch遇到一个目录时，它会将目录中的文件列表生成为内容中的一组超链接，否则它会读取文件内容。Nutch使用File.isDirectory（）确定给定路径是否为目录。因此，请检查您的路径是否真的被解释为目录

我发现为了抓取本地文件系统，必须在种子url的末尾添加斜杠，否则nutch不会将路径的最后一部分标识为目录

所以我把我的种子url从

file:////mnt/ntserver

到

然后事情就成功了

更多详情：

例如，如果我在

/mnt/ntserver

下有文件

test.txt

，并且有

file:////mnt/ntserver

作为我的种子url，nutch将正确解析

/mnt/ntserver

的索引，并发现有一个名为

test.txt的文件，但是它会尝试获取文件/mnt/test.txt
。将尾部斜杠添加到种子url后，使其file:////mnt/ntserver/
，nutch现在尝试获取文件/mnt/ntserver/test.txt
，解决了我的问题
顺便说一句，为了阻止nutch沿着文件夹树向上移动到根目录，我在nutch-default.xml中将file.crawl.parent
设置为false，但也可以通过regex-urlfilter.xml完成
file:////mnt/ntserver/