Java 使用jSoup将文本存储到字符串中
我试图理解如何将htmlUnit和jSoup一起使用,并且已经成功地理解了基础知识。但是,我尝试将特定网页中的文本存储到字符串中,但当我尝试这样做时,它只返回一行而不是整个文本Java 使用jSoup将文本存储到字符串中,java,html,string,jsoup,Java,Html,String,Jsoup,我试图理解如何将htmlUnit和jSoup一起使用,并且已经成功地理解了基础知识。但是,我尝试将特定网页中的文本存储到字符串中,但当我尝试这样做时,它只返回一行而不是整个文本 private static String getText() { String text = ""; try { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.ge
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
我知道我写的代码是有效的,当我打印出p.text时,它会返回存储在网站中的全部文本
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
private静态字符串getText(){
试一试{
最终WebClient WebClient=新WebClient();
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
字符串url=page1.getUrl().toString();
Document doc=Jsoup.connect(url.get();
元素段落=文件选择(“div[class=govspeak]p”);
(要素p:段落)
System.out.println(p.text());
}捕获(例外e){
e、 printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE,null,e);
}
返回null;
}
}
当我引入字符串的概念来存储p.text中的文本时,它只返回一行而不是整个文本
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
private静态字符串getText(){
字符串文本=”;
试一试{
最终WebClient WebClient=新WebClient();
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
字符串url=page1.getUrl().toString();
Document doc=Jsoup.connect(url.get();
元素段落=文件选择(“div[class=govspeak]p”);
(要素p:段落)
text=p.text();
}捕获(例外e){
e、 printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE,null,e);
}
返回文本;
}
最后,我要做的就是将整个文本存储到一个字符串中。任何帮助都将不胜感激,提前谢谢
Document doc = Jsoup.connect(url).get();
String text = doc.text();
基本上就是这样。由于JSoup已经在清理文本中的所有html标记,因此您可以使用doc.text()
,您将收到从html标记中清理的整个页面的内容
for (Element p : paragraphs)
text+=p.text(); // Append the text.
在代码中,您正在覆盖变量文本的值。这就是为什么函数只返回最后一行。我认为使用HtmlUnit结果作为jSoup的起点是一个奇怪的想法。您的方法有很多缺点(例如,考虑cookies)。当然,HtmlUnit已经解析了html代码;这项工作你要做两次 我希望这段代码能够在没有jSoup的情况下满足您的需求
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}
private static String getText()引发FailingHttpStatusCodeException、MalformedURLException、IOException{
StringBuilder text=新的StringBuilder();
try(WebClient-WebClient=new-WebClient()){
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
DomNodeList段落=page1.querySelectorAll(“div[class=govspeak]p”);
for(domp节点:段落){
append(p.asText());
}
}
返回text.toString();
}
非常感谢您!这是我解决问题的正确方法!