Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/377.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何过滤JSoup中嵌套标记中的噪声?JAVA_Java_Html_Xml_Parsing_Jsoup - Fatal编程技术网

如何过滤JSoup中嵌套标记中的噪声?JAVA

如何过滤JSoup中嵌套标记中的噪声?JAVA,java,html,xml,parsing,jsoup,Java,Html,Xml,Parsing,Jsoup,如何过滤嵌套标记中的噪波?例如,我有以下输入: [in::: <html> <source> <noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo </source> </html> foo bar

如何过滤嵌套标记中的噪波?例如,我有以下输入:

[in:::

<html>
  <source>
     <noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo
  </source>
</html>
foo bar bar
baring foo
something something, many many thingsfoo bar barmore something something noisebaring foo
我已经试过了,但是我仍然从嵌套的标签中听到噪音:

import java.io.*;
import java.util.List;

import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

public class HelloJsoup {
    public static void main(String[] args) throws IOException {

        String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
        Document doc = Jsoup.parse(br, "", Parser.xmlParser());
        //System.out.println(doc);
        for (Element sentence : doc.getElementsByTag("source"))
            System.out.print(sentence.text());

    }
}
尝试:

尝试:


通过首先删除噪波标记,您将得到
foo bar barbing foo
,不过为了实现指定的输出,您可以迭代节点并在新行上打印每个TextNode。例如:

String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());

Element source = doc.select("source").first(); // select source element

Elements noise = doc.select("noise");          // Select noise elements
for (Element e : noise) {                      // loop through and remove each from doc
    e.remove();
}

for (Node node : source.childNodes()) {
    System.out.println(node);                  // print each remaining textnode on a new line
}

更新

我发现这是一种更简单的方法:

Element source = doc.select("source").first(); // select source element

for (TextNode node : source.textNodes()) {
    System.out.println(node);
}
它遍历
元素直接拥有的textNodes,并将每个节点打印到新行。输出为:

foo bar bar
baring foo

通过首先删除噪波标记,您将得到
foo bar barbing foo
,不过为了实现指定的输出,您可以迭代节点并在新行上打印每个TextNode。例如:

String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());

Element source = doc.select("source").first(); // select source element

Elements noise = doc.select("noise");          // Select noise elements
for (Element e : noise) {                      // loop through and remove each from doc
    e.remove();
}

for (Node node : source.childNodes()) {
    System.out.println(node);                  // print each remaining textnode on a new line
}

更新

我发现这是一种更简单的方法:

Element source = doc.select("source").first(); // select source element

for (TextNode node : source.textNodes()) {
    System.out.println(node);
}
它遍历
元素直接拥有的textNodes,并将每个节点打印到新行。输出为:

foo bar bar
baring foo

语句.ownText()
返回正确的字符串集,但存在一些间距问题。有什么方法可以解决这个问题吗?
语句.ownText()
返回正确的字符串集,但存在一些间距问题。有什么办法解决这个问题吗?仅供参考,我更新了我的答案,因为我发现了一个更简单的方法。仅供参考,我更新了我的答案,因为我发现了一个更简单的方法。