java Jsoup问题:如何按单词分割?
我想得到没有标签的html内容和结果java Jsoup问题:如何按单词分割?,java,jsoup,Java,Jsoup,我想得到没有标签的html内容和结果 word word word 所以我尝试了以下方法 public class PreProcessing { public static void main(String\[\] args) throws Exception { PrintWriter out = new PrintWriter("filename.txt"); URL url = new URL("[https://en.wikipedia.
word
word
word
所以我尝试了以下方法
public class PreProcessing {
public static void main(String\[\] args) throws Exception {
PrintWriter out = new PrintWriter("filename.txt");
URL url = new URL("[https://en.wikipedia.org/wiki/Distributed\_computing](https://en.wikipedia.org/wiki/Distributed_computing)");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine = "";
String input = "";
while ((inputLine = in.readLine()) != null)
{
input += inputLine;
// System.out.println(inputLine);
}
//create Jsoup document from HTML
Document jsoupDoc = Jsoup.parse(input);
//set pretty print to false, so \\n is not removed
jsoupDoc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \\n after that
// [jsoupDoc.select](https://jsoupDoc.select)("br").after("\\\\n");
//select all <p> tags and prepend \\n before that
// [jsoupDoc.select](https://jsoupDoc.select)("p").before("\\\\n");
//get the HTML from the document, and retaining original new lines
String str = jsoupDoc.html().replaceAll(" ", "\n");
// str.replaceAll("\t", "");
String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
strWithNewLines.replaceAll("\t", "\n");
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
System.out.println(strWithNewLines);
out.print(strWithNewLines);
}
}
但我想要这样的结果
Distributed
computing
-
Wikipedia
Distributed
computing
From
Wikipedia
the
free
encyclopedia
Jump
to
navigation
Jump
to
search
Distributed
application
redirects
here
For
trustless
applications
see
我试着
strWithNewLines.replaceAll("\\"", "");
strWithNewLines.replaceAll(".", "");
但这并不奏效。为什么不起作用?我用谷歌搜索了一下,但找不到解决方案。最后几行试试这个。这将使您更接近您想要的结果:
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);
代码中的问题是字符串是不可变的,因此String.replaceAll
将不替换原始字符串中的任何内容,而是在已执行子位置生成一个新字符串。但你永远不会使用结果
.replaceAll(“.”,“)
有一个问题。这将为您提供一个空字符串,因为
匹配每个字符,并且它将被一个空字符串替换
String strWithNewLines = Jsoup.clean ...;
String result = strWithNewLines.replaceAll("\t", "\n")
.replaceAll("\"", "");
//.replaceAll(".", "");
System.out.println(result);