如何从Java/Scala中读取Nutch内容？_Java_Hadoop_Nutch

如何从Java/Scala中读取Nutch内容？

java hadoop

如何从Java/Scala中读取Nutch内容？,java,hadoop,nutch,Java,Hadoop,Nutch,我正在使用Nutch对一些网站进行爬网（作为一个独立于其他内容运行的过程），同时我希望使用Java（Scala）程序使用Jsoup分析网站的HTML数据我让Nutch按照（没有脚本，只执行单独的指令）工作，我认为它将网站的HTML保存在crawl/segments//content/part-00000目录中问题是，我不知道如何在Java/Scala程序中实际读取网站数据（URL和HTML）。我读过这篇文章，但发现它有点让人难以接受，因为我从未使用过Hadoop 我试图使示例代码适应我的环境

我正在使用Nutch对一些网站进行爬网（作为一个独立于其他内容运行的过程），同时我希望使用Java（Scala）程序使用Jsoup分析网站的HTML数据

我让Nutch按照（没有脚本，只执行单独的指令）工作，我认为它将网站的HTML保存在

crawl/segments//content/part-00000

目录中

问题是，我不知道如何在Java/Scala程序中实际读取网站数据（URL和HTML）。我读过这篇文章，但发现它有点让人难以接受，因为我从未使用过Hadoop

我试图使示例代码适应我的环境，这就是我得出的结论（主要是通过猜测）：

但是，我在运行时遇到了以下异常：

Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
    at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)

我不知道如何使用

MapFile.Reader

，特别是我应该传递给它的构造函数参数。我应该传入哪些配置对象？这是正确的文件系统吗？这就是我感兴趣的数据文件吗？

Scala:

val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)

val webdata = Stream.continually {
  val key = new Text()
  val content = new Content()
  reader.next(key, content)
  (key, content)
}

println(webdata.head)

public class ContentReader {
    public static void main(String[] args) throws IOException { 
        Configuration conf = NutchConfiguration.create();       
        Options opts = new Options();       
        GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);       
        String[] remainingArgs = parser.getRemainingArgs();     
        FileSystem fs = FileSystem.get(conf);
        String segment = remainingArgs[0];
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
        Text key = new Text();
        Content content = new Content();
        // Loop through sequence files
        while (reader.next(key, content)) {
            try {
                System.out.write(content.getContent(), 0,
                        content.getContent().length);
            } catch (Exception e) {
            }
        }
    }
}

Java:

val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)

val webdata = Stream.continually {
  val key = new Text()
  val content = new Content()
  reader.next(key, content)
  (key, content)
}

println(webdata.head)

public class ContentReader {
    public static void main(String[] args) throws IOException { 
        Configuration conf = NutchConfiguration.create();       
        Options opts = new Options();       
        GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);       
        String[] remainingArgs = parser.getRemainingArgs();     
        FileSystem fs = FileSystem.get(conf);
        String segment = remainingArgs[0];
        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
        Text key = new Text();
        Content content = new Content();
        // Loop through sequence files
        while (reader.next(key, content)) {
            try {
                System.out.write(content.getContent(), 0,
                        content.getContent().length);
            } catch (Exception e) {
            }
        }
    }
}

或者，你可以使用

org.apache.nutch.segment.SegmentReader

（）。

Hey@AmitChotaliya，据我所知，你实际上没有使用

org.apache.nutch.segment.SegmentReader

？（你的意思是在它之前也进行预写吗？作为你发布内容的替代方案？我现在假设是这样，所以我将“：”从我的编辑中删除并放回“.”之后。）哦，对不起，这是/或者我将更新答案。它起作用了吗？