Nutch 1.4和Solr 3.6-Nutch不爬行301/302重定向_Solr_Nutch

Nutch 1.4和Solr 3.6-Nutch不爬行301/302重定向

solr

Nutch 1.4和Solr 3.6-Nutch不爬行301/302重定向,solr,nutch,Solr,Nutch,我遇到了一个问题，重定向对初始页面进行了爬网，但没有对其进行爬网或索引我将http.redirect.max属性设置为5，尝试了值0、1和3 <property> <name>http.redirect.max</name> <value>5</value> <description>The maximum number of redirects the fetcher will follow when t

我遇到了一个问题，重定向对初始页面进行了爬网，但没有对其进行爬网或索引

我将http.redirect.max属性设置为5，尝试了值0、1和3

<property>
  <name>http.redirect.max</name>
  <value>5</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

另外，Nutch似乎只抓取和推送具有querystring参数的页面

当查看输出时

http://example.com/build    Version: 7
Status: 4 (db_redir_temp)
Fetch time: Fri Sep 12 00:32:33 EDT 2014
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2700 seconds (0 days)
Score: 0.04620983
Signature: null
Metadata: _pst_: temp_moved(13), lastModified=0: http://example.com/build/

有一个默认的IIS重定向，抛出302以添加尾部斜杠。我已确保所有页面上都已添加此斜杠。所以不确定为什么会被重定向

再多了解一点，这里是我尝试过的一些参数

depth=5 (tried 1-10)
threads=30 (tried 1 - 30)
adddays=7 (tried 0, 7)
topN=500 (tried 500, 1000)

试着在Web服务器上运行以查看服务的内容，在机器上运行Nutch以查看请求的内容。如果它们在同一台服务器上，很好。尝试一下，并在捕获后将HTTP添加到您的筛选器框中

depth=5 (tried 1-10)
threads=30 (tried 1 - 30)
adddays=7 (tried 0, 7)
topN=500 (tried 500, 1000)