Java SAX解析器-提取标记中的字符串_Java_Xml_Parsing_Saxparser

Java SAX解析器-提取标记中的字符串

java xml parsing

Java SAX解析器-提取标记中的字符串,java,xml,parsing,saxparser,Java,Xml,Parsing,Saxparser,这就是我的问题：我需要使用SAX解析器提取标记“p”之间的文本，而不使用XML表示法 <title>1. Introduction</title> <p>The Lorem ipsum <xref ref-type="bibr" rid="B1"> 1 </xref>. Lorem ipsum 23. <

这就是我的问题：我需要使用SAX解析器提取标记“

”之间的文本，而不使用XML表示法

    <title>1. Introduction</title>
    <p>The Lorem ipsum 
           <xref ref-type="bibr" rid="B1">
                1
           </xref>. 
           Lorem ipsum 23.
     </p>
     <p>The L domain recruits an ATP-requiring cellular factor for this 
           scission event, the only known energy-dependent step in assembly 
           <xref ref-type="bibr" rid="B2">
                2
           </xref>. 
           Domain is used here to denote the amino 
           acid sequence that constitutes the biological function.
     </p>

这就是我希望做的：一个包含两段内容的列表

listP

：

1) Lorem ipsum 1 Lorem ipsum 23.
2) The L domain recruits an ATP-requiring cellular factor for this 
       scission event, the only known energy-dependent step in assembly 2 
       Domain is used here to denote the amino 
       acid sequence that constitutes the biological function.

我不确定你所说的“使用endElement是否可能”是什么意思，但这肯定是可能的。您需要编写SAX应用程序，以便：

（1）忽略

aragraph的

startElement

事件之间的所有

startElement

事件-简单的状态跟踪，或者您可以简单地说您对段落以外的元素不感兴趣，并使您的元素事件处理程序对您不关心的任何事情都不起作用

（2）累积单独传递的

字符（）

事件，直到

图形的

结束。但无论如何都需要这样做，因为SAX始终保留将连续文本作为几个字符（）进行传递的权利，这与解析器缓冲区管理有关。
有许多可能的解决方案。通常使用SAX解析器，您只需添加一些布尔标志来表示解析时的某些特定状态。在这个简单的示例中，您只需更改以下内容即可实现此目的：
tmpValue = new String(ac, i, j);

为此：
if (tmpValue.equals(""))
    tmpValue = new String(ac, i, j);
else
    tmpValue += new String(ac, i, j);

或：
取决于初始化tmpValue
变量的方式（如果尚未初始化，则应进行初始化）
要收集所有段落的内容，您需要：
public void endElement(String s, String s1, String element) throws SAXException {

    if (element.equals(Finals.PARAGRAPH)) {
        Paragraph paragraph = new Paragraph();
        paragraph.setContext(tmpValue);
        System.out.println("Contesto: " + tmpValue);
        listP.add(paragraph);
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

省略标题部分：
public void startElement(
    String uri,
    String localName,
    String qName,
    Attributes atts) {

    if (localName.equals(Finals.PARAGRAPH)) {
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

使用堆栈


在startElement
事件中推送
，在endElement
事件中推送

或者，如果这对您不起作用，只需将所有事件的推入堆栈，然后在endOfDocument
之后，Pop
逐个元素。将数据从
反向存储到
。
因为显然endElement
是在。。。结束元素。您对名为CDATA的部分感兴趣。您应该为此找到适当的处理程序。你应该用你的实际代码展示你当前的尝试。看起来你做得很好。问题出在哪里？我需要这个结果L结构域为这个断裂事件招募了一个需要ATP的细胞因子，这是组装2中唯一已知的能量依赖性步骤。域在这里用于表示构成生物功能的氨基酸序列。
但我只得到域在这里用于表示构成生物功能的氨基酸序列。
我得到nullPointerException添加该解决方案。@user3162945请具体说明。我提供了两种解决方案。也有tmpValue
初始化，就像我建议的那样？我忘了初始化tmpValue
。现在可以了，但我没有得到完整的字符串。只有/xref
@user3162945后面的部分我误解了您最初的要求。检查此编辑。它可以工作，但它接受xml中的所有内容。我只需要p标签中的零件。检查我的编辑。
public void endElement(String s, String s1, String element) throws SAXException {

    if (element.equals(Finals.PARAGRAPH)) {
        Paragraph paragraph = new Paragraph();
        paragraph.setContext(tmpValue);
        System.out.println("Contesto: " + tmpValue);
        listP.add(paragraph);
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}

public void startElement(
    String uri,
    String localName,
    String qName,
    Attributes atts) {

    if (localName.equals(Finals.PARAGRAPH)) {
        tmpValue = ""; // or tmpValue = null; for the second version
    }
}