Hadoop 仅限Nutch 2.3.1爬行种子URL_Hadoop_Web Crawler_Nutch

Hadoop 仅限Nutch 2.3.1爬行种子URL

hadoop web-crawler

Hadoop 仅限Nutch 2.3.1爬行种子URL,hadoop,web-crawler,nutch,Hadoop,Web Crawler,Nutch,我必须抓取几个URL的所有内链接（最多）。为此，我将ApacheNutch2.3.1与hadoop和hbase结合使用。以下是用于此目的的nutch-site.xml文件 <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  &l

我必须抓取几个URL的所有内链接（最多）。为此，我将ApacheNutch2.3.1与hadoop和hbase结合使用。以下是用于此目的的nutch-site.xml文件

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
   <name>http.agent.name</name>
   <value>crawler</value>
</property>
<property>
   <name>storage.data.store.class</name>
   <value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|protocol-http|indexer-solr|urlfilter-regex|parse-(html|tika)|index-(basic|more|urdu)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
</property>
<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
</property>
<property>
  <name>http.robots.agents</name>
  <value>crawler,*</value>
</property>

<!-- language-identifier plugin properties -->

<property>
  <name>lang.ngram.min.length</name>
  <value>1</value>
</property>

<property>
  <name>lang.ngram.max.length</name>
  <value>4</value>
</property>

<property>
  <name>lang.analyze.max.length</name>
  <value>2048</value>
</property>

<property>
  <name>lang.extraction.policy</name>
  <value>detect,identify</value>
</property>

<property>
  <name>lang.identification.only.certain</name>
  <value>true</value>
</property>

<!-- Language properties ends here -->
<property> 
         <name>http.timeout</name> 
         <value>20000</value> 
</property> 
<!-- These tags are included as our crawled documents has started to decrease -->
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
</property>
<property>
  <name>generate.max.count</name>
  <value>10000</value>
</property>

<property>
 <name>db.ignore.external.links</name>
 <value>true</value>
</property>
</configuration>

这里也提到了类似的问题，但它适用于版本1.1，我已经实现了不适用于我的解决方案。

您可以检查您的

conf/regex urlfilter.txt

它的url过滤regex是否阻止了预期的大纲链接吗

# accept anything else
+.

当您将

db.ignore.external.links

设置为

true

时，Nutch不会从不同的主机生成大纲链接。您还需要在

conf/nutch default.xml

中检查

db.ignore.internal.links

属性是否为

false

。否则，将没有要生成的大纲链接

<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>


db.ignore.internal.links
假的
db.ignore.external.links
真的

HTH.

您找到解决此问题的方法了吗？您需要在注入种子后遵循以下循环：生成>获取>解析>更新。因为在单次爬网中无法获取所有链接，所以必须多次遵循此循环。

<property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
</property>
<property>
    <name>db.ignore.external.links</name>
    <value>true</value>
</property>
<property>