Java 如何使用ApacheTika编写自定义ContentHandler？_Java_Html Parsing_Apache Tika

Java 如何使用ApacheTika编写自定义ContentHandler？

java

Java 如何使用ApacheTika编写自定义ContentHandler？,java,html-parsing,apache-tika,Java,Html Parsing,Apache Tika,我想使用ApacheTika从HTML文件中提取一些标记（如，）中的文本因此，我正在编写自定义ContentHandler，它应该从这些标记中提取信息我的自定义ContentHandler代码如下所示。它尚未完成，但尚未按预期工作： public class TableContentHandler implements ContentHandler { // key = abbreviation // value = information / description fo

我想使用ApacheTika从HTML文件中提取一些标记（如

，

）中的文本

因此，我正在编写自定义

ContentHandler

，它应该从这些标记中提取信息

我的自定义

ContentHandler

代码如下所示。它尚未完成，但尚未按预期工作：

public class TableContentHandler implements ContentHandler {

    // key = abbreviation
    // value = information / description for abbreviation
    private Map<String, String> abbreviations = new HashMap<String, String>();

    // current abbreviation
    private String abbreviation = null;

    // <dd> element contains abbreviation. So this boolean variable will be set when
    // <dd> element is found
    private boolean ddElementStarted = false;

    // this method is not giving contents within <dd> and </dd> tags
    public void characters(char[] chars, int arg1, int arg2) throws SAXException {
            if(ddElementStarted) {
                    System.out.println("chars found...");
            }
    }

    // set boolean ddElementStarted to true to indicate that content handler found 
    // <dd> element
    public void startElement(String arg0, String element, String arg2, Attributes arg3) throws SAXException {
            if(element.equalsIgnoreCase("dd")) {
                    ddElementStarted = true;
            }
    }
}

公共类TableContentHandler实现ContentHandler{
//关键字=缩写
//值=缩写的信息/描述
私有映射缩写=新HashMap（）；
//当前缩写
私有字符串缩写=null；
//元素包含缩写。因此当
//元素被找到
私有布尔DDelementStart=false；
//此方法不提供和标记中的内容
公共无效字符（字符[]字符，整数arg1，整数arg2）引发SAXException{
如果（DDelementStart）{
System.out.println（“找到字符…”）；
}
}
//将布尔值ddElementStarted设置为true，以指示找到了内容处理程序
//元素
public void startElement（字符串arg0、字符串元素、字符串arg2、属性arg3）引发异常{
if（元素等信号情况（“dd”））{
ddElementStarted=true；
}
}
}

这里我的假设是，只要内容处理程序进入

startElement（）

方法，并且元素名称是

dd

，那么我将设置

ddElementStarted=true

，然后要获取

和

元素中的内容，我将签入

characters（）

方法

在

characters（）

方法中，我正在检查

ddElementStarted=true

和

chars

数组是否将包含

和

元素中的内容，但它不起作用：(

我想知道

我的方向正确吗

这是使用Tika解析HTML的正确方法吗？还是有其他方法

我应该选择另一个HTML解析API，比如JSoup吗？我只需要几个标记的信息，比如，我对HTML页面的其余部分不感兴趣

有没有办法在Apache Tika中指定

XPath

表达式？我在

Tika in Action

一书中找不到这些信息

简单的解决方案是。我们可以很容易地获得任何标记中的值。因此，不用编写新的ContentHandler，只需使用JSoup进行解析