使用Cassandra作为存储的Nutch 2无法正确抓取数据

使用Cassandra作为存储的Nutch 2无法正确抓取数据,cassandra,web-crawler,nutch,gora,Cassandra,Web Crawler,Nutch,Gora,我使用Nutch2.x,使用Cassandra作为存储。目前,我只是在抓取一个网站,数据正在以字节码格式加载到Cassandra。 当我在Nutch中使用readdb命令时,我确实得到了任何有用的爬行数据 以下是我获得的不同文件和输出的详细信息: 运行爬虫程序的命令=================================== bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3 http://www.ft.com # ski

我使用Nutch2.x,使用Cassandra作为存储。目前,我只是在抓取一个网站,数据正在以字节码格式加载到Cassandra。 当我在Nutch中使用readdb命令时,我确实得到了任何有用的爬行数据

以下是我获得的不同文件和输出的详细信息:

运行爬虫程序的命令===================================

bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3
http://www.ft.com
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.    
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else    
+.
2015-02-18 13:57:51,253 ERROR store.CassandraStore - 
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90
2015-02-18 14:01:45,537 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
=================================================================================================

bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3
http://www.ft.com
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.    
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else    
+.
2015-02-18 13:57:51,253 ERROR store.CassandraStore - 
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90
2015-02-18 14:01:45,537 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
==从cassandra网页读取数据的readdb命令的输出。f表====

~/Documents/Softwares/apache-nutch-2.3/runtime/local$ bin/nutch readdb -dump data -content
~/Documents/Softwares/apache-nutch-2.3/runtime/local/data$ cat part-r-00000 
http://www.ft.com/  key:    com.ft.www:http/
baseUrl:    null    
status: 4 (status_redir_temp)    
fetchTime:  1426888912463
prevFetchTime:  1424296904936
fetchInterval:  2592000
retriesSinceFetch:  0    
modifiedTime:   0    
prevModifiedTime:   0
protocolStatus: (null)    
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :   y
marker dist :   0    
reprUrl:    null    
batchId:    1424296906-20007    
metadata _csh_ : 
================================================================

bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3
http://www.ft.com
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.    
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else    
+.
2015-02-18 13:57:51,253 ERROR store.CassandraStore - 
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90
2015-02-18 14:01:45,537 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
================================================================

bin/crawl urls/ crawlDir/ http://localhost:8983/solr/ 3
http://www.ft.com
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.    
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else    
+.
2015-02-18 13:57:51,253 ERROR store.CassandraStore - 
2015-02-18 13:57:51,253 ERROR store.CassandraStore - [Ljava.lang.StackTraceElement;@653e3e90
2015-02-18 14:01:45,537 INFO  connection.CassandraHostRetryService - Downed Host Retry service started with queue size -1 and retry delay 10s
如果你需要更多信息,请告诉我。 有人能帮我吗

提前谢谢。
-Sumant

我今天才开始使用Nutch和Cassandra。爬网期间,我的日志文件中没有收到相同的错误

您是否仔细检查了nutch-site.xml和gora.properties设置?这是我当前配置文件的方式

nutch-site.xml

    <configuration>
    <property>
    <name>http.agent.name</name>
    <value>My Spider</value>
    </property>
    <property> 
       <name>storage.data.store.class</name> 
       <value>org.apache.gora.cassandra.store.CassandraStore</value>
       <description>Default class for storing data</description>
    </property>
</configuration>

你可能想编辑你的帖子!你的代码有点难读,我想可能需要格式化。你想我删除一些内容吗?我想你应该删除。这可能会帮助你回答问题。完成编辑。。希望这是可读的格式…谢谢你的答复。我在两个文件中添加了相同的设置。您能否先发布regex-urlfilter.txt文件的内容,然后发布seed.txt文件,以及您的cassandra网页.f表中加载的内容?因为我认为我的regex-urlfilter.txt中存在跳过链接的问题。你能抓取所有链接吗?我的regexurlfilter.txt与你的完全一样,我测试了一个或两个链接,其中一个是[。我也没有通过第一级,所以与你有相同的问题,只是我没有得到“ERROR store.CassandraStore”在我的日志中报告。因此我也在查看我的正则表达式设置。我将添加我的网页.f表。抱歉,我认为我至少可以让您通过CassadraStore错误。谢谢。现在我也没有得到Cassandra错误:)不知道它是如何解决的:P问题是Nutch现在除了seed.txt中的链接外,没有抓取任何其他链接。请如果你能解决这个问题,请告诉我。如果你需要任何帮助,请告诉我。克里斯,有什么进展吗?Sumant,今天回来,还没有成功。发布到Nutch用户组。