Java 通过tomcat7运行HtmlUnit

Java 通过tomcat7运行HtmlUnit,java,htmlunit,Java,Htmlunit,我正试图使用HTMLUnit生成ajax页面的可爬行HTML快照(如所建议的)。其想法是创建功能,允许企业通过常规计划服务或根据自己的意愿创建快照 我写了一个快速的POC主类来测试这个理论,它按照预期工作(当我们查看源代码时,我们可以看到Google爬虫程序所需的所有数据,这是我们以前看不到的)。我现在正在将它集成到运行在Tomcat7上的应用程序中,我在从Google下载jquery.js时遇到了一个问题,其中包含以下日志消息 2013-03-15 18:10:38,071 ERROR [au

我正试图使用HTMLUnit生成ajax页面的可爬行HTML快照(如所建议的)。其想法是创建功能,允许企业通过常规计划服务或根据自己的意愿创建快照

我写了一个快速的POC主类来测试这个理论,它按照预期工作(当我们查看源代码时,我们可以看到Google爬虫程序所需的所有数据,这是我们以前看不到的)。我现在正在将它集成到运行在Tomcat7上的应用程序中,我在从Google下载jquery.js时遇到了一个问题,其中包含以下日志消息

2013-03-15 18:10:38,071 ERROR [author->taskExecutor-1] com.gargoylesoftware.htmlunit.html.HtmlPage       : Error loading JavaScript from [https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.js].
javax.net.ssl.SSLException: hostname in certificate didn't match: <ajax.googleapis.com/173.194.67.95> != <*.googleapis.com> OR <*.googleapis.com> OR <googleapis.com>
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:228)
at org.apache.http.conn.ssl.BrowserCompatHostnameVerifier.verify(BrowserCompatHostnameVerifier.java:54)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:149)
at org.apache.http.conn.ssl.AbstractVerifier.verify(AbstractVerifier.java:130)
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:397)
at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:495)
at org.apache.http.conn.scheme.SchemeSocketFactoryAdaptor.connectSocket(SchemeSocketFactoryAdaptor.java:62)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)

...
这是我的集成版

        /* Entry point for the generation */
     public void generate() {

        log.info("Beginning snapshot generation...");

        try {

            // Get the URLS
            log.info("Retrieving list of page urls");
            List<String> pageUrls = getUrlList();
            log.info("Found {} urls to generate", pageUrls.size());

            // For every url we have generate a snapshot
            for (String pageUrl: pageUrls) {
                takeSnapshot(pageUrl);
            }
            log.info("Finished generating snapshots!");
        } catch (Exception e) {
            log.error("Exception caught while generating snapshot", e);
        }
    }

    /**
     * Take the HTML snapshot of the url and output to the snapshot directory
     */
    private void takeSnapshot(String pagePath) {
        try {
            String fullOutputFilePath = config.getHtmlSnapshotDirectory() + File.separator
                                                        + pagePath + File.separator + HTML_SNAPSHOT_FILE_NAME;
            String pageUrl = "http://myurl.com" + pagePath;

            log.debug("Instantiating Web Client...");
            final WebClient webClient = new WebClient();
            log.debug("Client instantiated");
            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setPrintContentOnFailingStatusCode(false);
            final HtmlPage page = (HtmlPage)webClient.getPage(pageUrl);

            webClient.waitForBackgroundJavaScript(1500);

            snapshotFile = new File(fullOutputFilePath);
            FileUtils.touch(snapshotFile);

            writer = new OutputStreamWriter(new FileOutputStream(snapshotFile), "UTF-8");
            writer.write(page.asXml());
            writer.flush();
        } catch (MalformedURLException mue) {
            System.out.println("MalformedURL exception");
        } catch (IOException ioe) {
            System.out.println("IOException occurred " +  ioe.getMessage());
        } finally {
            IOUtils.closeQuietly(writer);
        }
    }
/*生成的入口点*/
public void generate(){
log.info(“开始生成快照…”);
试一试{
//获取URL
log.info(“检索页面URL列表”);
List pageUrls=getUrlList();
log.info(“找到{}个要生成的URL”,pageUrls.size());
//我们为每个url生成一个快照
for(字符串pageUrl:pageUrl){
快照(页面URL);
}
log.info(“已完成生成快照!”);
}捕获(例外e){
log.error(“生成快照时捕获异常”,e);
}
}
/**
*获取url的HTML快照并输出到快照目录
*/
私有void快照(字符串页面路径){
试一试{
字符串fullOutputFilePath=config.getHtmlSnapshotDirectory()+File.separator
+pagePath+File.separator+HTML\u快照\u文件\u名称;
字符串pageUrl=”http://myurl.com“+页面路径;
调试(“实例化Web客户端…”);
最终WebClient WebClient=新WebClient();
调试(“客户机实例化”);
webClient.getOptions().SetThroweExceptionOnScriptError(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
最终的HtmlPage=(HtmlPage)webClient.getPage(pageUrl);
webClient.waitForBackgroundJavaScript(1500);
snapshotFile=新文件(fullOutputFilePath);
FileUtils.touch(快照文件);
writer=newoutputstreamwriter(新文件outputstream(快照文件),“UTF-8”);
writer.write(page.asXml());
writer.flush();
}捕获(格式不正确){
System.out.println(“畸形异常”);
}捕获(ioe异常ioe){
System.out.println(“发生IOException”+ioe.getMessage());
}最后{
(作家);
}
}
Maven依赖关系

        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.2.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.3-alpha1</version>
        </dependency>

net.sourceforge.htmlunit
htmlunit
2.12
org.apache.httpcomponents
httpclient
4.2.3
org.apache.httpcomponents
httpcore
4.3-1

谢谢大家

因此添加
webClient.getOptions().setUseSecureSSL(true)是解决此问题的关键。但是,我不得不使用不推荐的版本
webClient.setUseSecureSSL(true)


我不知道为什么新版本在Tomcat中运行时不能工作,但它解决了这个问题。如果有人能深入了解为什么这会很好。我还不明白为什么在运行Tomcat时设置BrowserVersion会导致应用程序停止。我已经向HtmlUnit邮件列表询问了这些问题的答案。

我发现我可以使用
webClient.getOptions().setUseSecureSSL(true)
试图绕过SSL问题。但是,当我在其他
.getOptions().set…
statements上方包含这一行时,代码就挂起在这一行上(就像指定浏览器版本时那样)。这意味着我仍然被困住了。非常感谢您的帮助。
        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.12</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.2.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.3-alpha1</version>
        </dependency>