java.lang.NullPointerException（nutch 2.2.1和MySql作为数据存储）_Java_Mysql_Nutch

java.lang.NullPointerException（nutch 2.2.1和MySql作为数据存储）

java mysql

java.lang.NullPointerException（nutch 2.2.1和MySql作为数据存储）,java,mysql,nutch,Java,Mysql,Nutch,我是这方面的新手。我从本教程开始：。当我第一次爬网这个url:nutch.apache.org时，我获得了成功，但当我尝试另一个url时，我的hadoop.log中出现了这个异常 **java.lang.NullPointerException at org.apache.avro.util.Utf8.<init>(Utf8.java:37) at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorRedu

我是这方面的新手。我从本教程开始：。当我第一次爬网这个url:nutch.apache.org时，我获得了成功，但当我尝试另一个url时，我的hadoop.log中出现了这个异常

**java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)**

如果有任何解决此问题的建议，我将不胜感激

我从未使用过nutch，但这似乎是一个常见错误，在init启动NPE意味着UTF8实例在创建时失败

原因是“crawl”函数在Nutch2中被弃用，取而代之的是位于“bin/crawl”中的java文件

只需将文件$NUTCH_HOME/src/bin/crawl复制到部署目录：$NUTCH_HOME/runtime/deploy/bin，然后运行爬网命令，请查看以下内容：

希望这有帮助。

嗨。我要试一试。谢谢你的建议：）

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>http.agent.name</name>
<value>Maria</value>
</property>

<property> 
<name>http.robots.agents</name> 
<value>Maria</value> ....
</description> 
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

</configuration>

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.          
(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip
|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov
|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
#+.

+^http://([a-z0-9]*\.)* nutch.apache.org/

#
-.