Java 如何创建将原始html返回给解析器的nutch插件_Java_Solr_Nutch

Java 如何创建将原始html返回给解析器的nutch插件

java solr

Java 如何创建将原始html返回给解析器的nutch插件,java,solr,nutch,Java,Solr,Nutch,我正在尝试为nutch创建一个插件。我使用的是Nutch1.7和solr。我用了很多不同的教程。我想实现一个返回原始html数据的插件。我使用了nutch的标准wiki和以下教程：我创建了两个文件getDivinfohtml.java和getDivinfo.java java需要读取内容，然后返回完整的源代码。或者至少是源代码的一部分 package org.apache.nutch.indexer; public class getDivInfohtml implements HtmlP

我正在尝试为nutch创建一个插件。我使用的是Nutch1.7和solr。我用了很多不同的教程。我想实现一个返回原始html数据的插件。我使用了nutch的标准wiki和以下教程：

我创建了两个文件getDivinfohtml.java和getDivinfo.java

java需要读取内容，然后返回完整的源代码。或者至少是源代码的一部分

 package org.apache.nutch.indexer;
 public class getDivInfohtml implements HtmlParseFilter
 {
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
    public static final String TAG_KEY = "source";
    // Logger logger = Logger.getLogger("mylog");
    // FileHandler fh;
    //FileSystem fs = FileSystem.get(conf);
    //Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    //SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
    //Text key = new Text();
    // Content content = new Content();
    // fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);


public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
    try
    {
        LOG.info("Parsing Url:" + content.getUrl());
        LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));

        Parse parse = parseResult.get(content.getUrl());
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent);
        Element contentwrapper = document.select("div#jobBodyContent").first();
        String source = contentwrapper.text();
        metadata.add("SOURCE", source);

        return parseResult;

    }
    catch(Exception e)
    {
        LOG.info(e);
    }

    return parseResult;
}


public Configuration getConf()
{
    return conf;
}

public void setConf(Configuration conf)
{
    this.conf = conf;
}

错误在getDivinfo中：String fullSource=parse.getData（）.getParseMeta（）.getValues（getDivInfohtml.TAG_键）

[javac]/root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58:错误：找不到符号

[javac]String fullSource=parse.getData（）.getParseMeta（）.getValues（getDivInfohtml.TAG_键）

您可能需要实现HTMLParser。在getFields实现中

 private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  public Collection<Field> getFields() {
    return FIELDS;
  }

private static final Collection FIELDS=new HashSet（）；
静止的{
字段。添加（网页。字段。内容）；
FIELDS.add（WebPage.Field.OUTLINKS）；
}
公共集合getFields（）{
返回字段；
}

 private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  public Collection<Field> getFields() {
    return FIELDS;
  }