Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/java/343.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 使用jSoup将文本存储到字符串中_Java_Html_String_Jsoup - Fatal编程技术网

Java 使用jSoup将文本存储到字符串中

Java 使用jSoup将文本存储到字符串中,java,html,string,jsoup,Java,Html,String,Jsoup,我试图理解如何将htmlUnit和jSoup一起使用,并且已经成功地理解了基础知识。但是,我尝试将特定网页中的文本存储到字符串中,但当我尝试这样做时,它只返回一行而不是整个文本 private static String getText() { String text = ""; try { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.ge

我试图理解如何将htmlUnit和jSoup一起使用,并且已经成功地理解了基础知识。但是,我尝试将特定网页中的文本存储到字符串中,但当我尝试这样做时,它只返回一行而不是整个文本

private static String getText() {
    String text = "";
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            text=p.text();
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return text;
}
我知道我写的代码是有效的,当我打印出p.text时,它会返回存储在网站中的全部文本

private static String getText() {
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            System.out.println(p.text());
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return null;
}
private静态字符串getText(){
试一试{
最终WebClient WebClient=新WebClient();
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
字符串url=page1.getUrl().toString();
Document doc=Jsoup.connect(url.get();
元素段落=文件选择(“div[class=govspeak]p”);
(要素p:段落)
System.out.println(p.text());
}捕获(例外e){
e、 printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE,null,e);
}
返回null;
}
}

当我引入字符串的概念来存储p.text中的文本时,它只返回一行而不是整个文本

private static String getText() {
    String text = "";
    try {
        final WebClient webClient = new WebClient();
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        String url = page1.getUrl().toString();
        Document doc = Jsoup.connect(url).get();
        Elements paragraphs = doc.select("div[class=govspeak] p");
        for (Element p : paragraphs)
            text=p.text();
    } catch (Exception e) {
        e.printStackTrace();
        Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
    }
    return text;
}
private静态字符串getText(){
字符串文本=”;
试一试{
最终WebClient WebClient=新WebClient();
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
字符串url=page1.getUrl().toString();
Document doc=Jsoup.connect(url.get();
元素段落=文件选择(“div[class=govspeak]p”);
(要素p:段落)
text=p.text();
}捕获(例外e){
e、 printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE,null,e);
}
返回文本;
}
最后,我要做的就是将整个文本存储到一个字符串中。任何帮助都将不胜感激,提前谢谢

Document doc = Jsoup.connect(url).get();
String text = doc.text();
基本上就是这样。由于JSoup已经在清理文本中的所有html标记,因此您可以使用
doc.text()
,您将收到从html标记中清理的整个页面的内容

    for (Element p : paragraphs)
        text+=p.text(); // Append the text.

在代码中,您正在覆盖变量文本的值。这就是为什么函数只返回最后一行。

我认为使用HtmlUnit结果作为jSoup的起点是一个奇怪的想法。您的方法有很多缺点(例如,考虑cookies)。当然,HtmlUnit已经解析了html代码;这项工作你要做两次

我希望这段代码能够在没有jSoup的情况下满足您的需求

private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    StringBuilder text = new StringBuilder();
    try (WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
        List<HtmlAnchor> anchors = page.getAnchors();
        HtmlPage page1 = anchors.get(18).click();
        DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
        for (DomNode p : paragraphs) {
            text.append(p.asText());
        }
    }
    return text.toString();
}
private static String getText()引发FailingHttpStatusCodeException、MalformedURLException、IOException{
StringBuilder text=新的StringBuilder();
try(WebClient-WebClient=new-WebClient()){
最终HtmlPage=webClient.getPage(“https://www.gov.uk/government/policies/brexit");
列表锚=page.getAnchors();
HtmlPage page1=anchors.get(18).click();
DomNodeList段落=page1.querySelectorAll(“div[class=govspeak]p”);
for(domp节点:段落){
append(p.asText());
}
}
返回text.toString();
}

非常感谢您!这是我解决问题的正确方法!