页面内容使用JavaScript加载，Jsoup不会'；我看不出来_Javascript_Java_Html_Parsing_Jsoup

页面内容使用JavaScript加载，Jsoup不会'；我看不出来

javascript java html parsing

页面内容使用JavaScript加载，Jsoup不会'；我看不出来,javascript,java,html,parsing,jsoup,Javascript,Java,Html,Parsing,Jsoup,页面上的一个块由JavaScript填充内容，在使用Jsoup加载页面后，没有任何信息。在使用Jsoup解析页面时，是否有方法也获取JavaScript生成的内容无法在此处粘贴页面代码，因为它太长：以下是我需要的内容元素：我需要用Java获取这些信息。最好使用Jsoup。元素在JavaScript的帮助下是字段： <div id="tags_list"> <a href="/tagsc0t20099.html" style="font-size:14;">р

页面上的一个块由JavaScript填充内容，在使用Jsoup加载页面后，没有任何信息。在使用

Jsoup

解析页面时，是否有方法也获取JavaScript生成的内容

无法在此处粘贴页面代码，因为它太长：

以下是我需要的内容元素：

我需要用Java获取这些信息。最好使用Jsoup。元素在JavaScript的帮助下是字段：

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">разведчик</a>
    <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">стратегический</a>
</div>

JSoup是一个HTML解析器，而不是某种嵌入式浏览器引擎。这意味着它完全不知道在初始页面加载之后Javascript添加到DOM中的任何内容

要访问该类型的内容，您需要一个嵌入式浏览器组件，关于该类型的组件有很多讨论，例如

在使用Jsoup解析页面时，是否有办法获得javascript生成的内容

我想，如果不在Java中构建一个完整的javascript解释器，这将是多么困难。

事实上，有一种“方法”！也许这更像是一种“变通方法”，而不是一种“方式”…下面的代码检查元属性“刷新”和javascript重定向…如果其中任何一个存在

RedirectedUrl

变量已设置。因此您知道您的目标…然后您可以检索目标页面并继续

    String RedirectedUrl=null;
    Elements meta = page.select("html head meta");
    if (meta.attr("http-equiv").contains("REFRESH")) {
        RedirectedUrl = meta.attr("content").split("=")[1];
    } else {
        if (page.toString().contains("window.location.href")) {
            meta = page.select("script");
            for (Element script:meta) {
                String s = script.data();
                if (!s.isEmpty() && s.startsWith("window.location.href")) {
                    int start = s.indexOf("=");
                    int end = s.indexOf(";");
                    if (start>0 && end >start) {
                        s = s.substring(start+1,end);
                        s =s.replace("'", "").replace("\"", "");        
                        RedirectedUrl = s.trim();
                        break;
                    }
                }
            }
        }
    }

... now retrieve the redirected page again...

您需要了解正在发生的事情：

当您从网站查询页面时，无论是使用Jsoup还是您的浏览器，返回给您的都是一些HTML。Jsoup能够解析这些内容
但是，大多数网站在该HTML中包含Javascript，或从该HTML链接，这将使用内容填充页面。您的浏览器能够执行Javascript，从而填充页面。Jsoup不能

理解这一点的方法如下：解析HTML代码很简单。执行Javascript代码并更新相应的HTML代码要复杂得多，这是浏览器的工作

以下是针对此类问题的一些解决方案：

如果您可以找到Javascript代码正在进行的Ajax调用是什么，即加载内容，那么您可能可以在Jsoup中使用这些调用的URL。为此，请使用浏览器中的开发人员工具。但这并不保证有效：

url可能是动态的，这取决于当时页面上的内容
如果内容不是公开的，则会涉及cookie，仅查询资源URL是不够的

在这些情况下，您需要“模拟”浏览器的工作。幸运的是，存在这样的工具。我知道并推荐的一个工具是。它与Javascript一起工作，您需要通过启动一个新流程从Java启动它。如果您想坚持使用Java，请列出一些Java替代方案

用com.codeborne.phantomjsdriver解决了我的问题注意：它是groovy代码

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

ClassInProject.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

尝试：

Document Doc=Jsoup.connect（url）
.header（“接受编码”、“gzip、deflate”）
.userAgent（“Mozilla/5.0（Windows NT 6.1；WOW64；rv:23.0）Gecko/20100101 Firefox/23.0”）
.maxBodySize（0）
.超时（600000）
.get（）；

指定用户代理后，我的问题就解决了

可以通过将

JSoup

与另一个框架相结合来解释网页，在这里的示例中，我使用

HtmlUnit

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

...

WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);

Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

加载JavaScript脚本后，可以使用JSoup和HtmlUnit的组合来获取页面内容

pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

一个复杂的例子：加载登录，获取会话和CSRF，然后发布并等待主页完成加载（15秒）

（顺便说一句，这个问题早在写这个答案的时候就存在了，技术上也早在几年前就存在了。问题不是Java中的JavaScript，而是Java中的可嵌入浏览器——JS只是谜题的一部分。）@DaveNewton JSoup已经包含了谜题的其他部分（DOM实现、请求机制）当然，抛弃JSoup并使用可嵌入浏览器的组合功能（DOM、请求、javascript解释器）会容易得多1.Android应用程序案例：我不确定它是否有帮助。而且我在Android视图中从未遇到过同样的行为。2.Android web案例：Phantomjs驱动程序解决方案的问题：你不运行UI驱动程序。事实上，PhantomDriver是“使用内置JavaScript API运行的无头webkit”。它是一个无UI驱动程序。通常我仅用于收集数据。您无法确定该UI视图是否正确。任何其他可用于

Android

获取页面内容的

libs

都是使用

JavaScript

加载的？

import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;

import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();

String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");

HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");

WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
            new URL("https://url"),
            "https://referrer");

// Add other cookies/ Session ...

webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);

requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);

// Wait
synchronized (page) {
    try {
        page.wait(15000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());