如何过滤JSoup中嵌套标记中的噪声?JAVA
如何过滤嵌套标记中的噪波?例如,我有以下输入: [in:::如何过滤JSoup中嵌套标记中的噪声?JAVA,java,html,xml,parsing,jsoup,Java,Html,Xml,Parsing,Jsoup,如何过滤嵌套标记中的噪波?例如,我有以下输入: [in::: <html> <source> <noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo </source> </html> foo bar
<html>
<source>
<noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo
</source>
</html>
foo bar bar
baring foo
something something, many many thingsfoo bar barmore something something noisebaring foo
我已经试过了,但是我仍然从嵌套的标签中听到噪音:
import java.io.*;
import java.util.List;
import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
public class HelloJsoup {
public static void main(String[] args) throws IOException {
String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
//System.out.println(doc);
for (Element sentence : doc.getElementsByTag("source"))
System.out.print(sentence.text());
}
}
尝试:
尝试:
通过首先删除噪波标记,您将得到
foo bar barbing foo
,不过为了实现指定的输出,您可以迭代节点并在新行上打印每个TextNode。例如:
String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
Element source = doc.select("source").first(); // select source element
Elements noise = doc.select("noise"); // Select noise elements
for (Element e : noise) { // loop through and remove each from doc
e.remove();
}
for (Node node : source.childNodes()) {
System.out.println(node); // print each remaining textnode on a new line
}
更新 我发现这是一种更简单的方法:
Element source = doc.select("source").first(); // select source element
for (TextNode node : source.textNodes()) {
System.out.println(node);
}
它遍历
元素直接拥有的textNodes,并将每个节点打印到新行。输出为:
foo bar bar
baring foo
通过首先删除噪波标记,您将得到
foo bar barbing foo
,不过为了实现指定的输出,您可以迭代节点并在新行上打印每个TextNode。例如:
String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
Element source = doc.select("source").first(); // select source element
Elements noise = doc.select("noise"); // Select noise elements
for (Element e : noise) { // loop through and remove each from doc
e.remove();
}
for (Node node : source.childNodes()) {
System.out.println(node); // print each remaining textnode on a new line
}
更新 我发现这是一种更简单的方法:
Element source = doc.select("source").first(); // select source element
for (TextNode node : source.textNodes()) {
System.out.println(node);
}
它遍历
元素直接拥有的textNodes,并将每个节点打印到新行。输出为:
foo bar bar
baring foo
语句.ownText()
返回正确的字符串集,但存在一些间距问题。有什么方法可以解决这个问题吗?语句.ownText()
返回正确的字符串集,但存在一些间距问题。有什么办法解决这个问题吗?仅供参考,我更新了我的答案,因为我发现了一个更简单的方法。仅供参考,我更新了我的答案,因为我发现了一个更简单的方法。