Java 下载整个网页
使用Java 下载整个网页,java,javascript,download,scroll,webpage,Java,Javascript,Download,Scroll,Webpage,使用HTMLEditorKit可以下载整个网页。但是,我需要下载一个需要滚动才能加载其全部内容的整个网页。这项技术通常是通过与Ajax捆绑的JavaScript实现的 Q.:是否有办法欺骗目标网页,仅使用下载其全部内容 Q.2:如果这不仅在Java中是可能的,那么与JavaScript结合使用是否也是可能的 简单的通知,我写的是: 你可以用IDM的抓取器来做 这将有助于: 是的,您可以通过Java代码在本地下载网页。您不能通过Java脚本下载HTMl静态内容。JavaScript并没有像Java
HTMLEditorKit
可以下载整个网页。但是,我需要下载一个需要滚动才能加载其全部内容的整个网页。这项技术通常是通过与Ajax捆绑的JavaScript实现的
Q.:是否有办法欺骗目标网页,仅使用下载其全部内容
Q.2:如果这不仅在Java中是可能的,那么与JavaScript结合使用是否也是可能的
简单的通知,我写的是:
你可以用IDM的抓取器来做 这将有助于:
是的,您可以通过Java代码在本地下载网页。您不能通过Java脚本下载HTMl静态内容。JavaScript并没有像Java提供的那样提供创建文件的功能
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
public class HttpDownloadUtility {
private static final int BUFFER_SIZE = 4096;
/**
* Downloads a file from a URL
* @param fileURL HTTP URL of the file to be downloaded
* @param saveDir path of the directory to save the file
* @throws IOException
*/
public static void downloadFile(String fileURL, String saveDir)
throws IOException {
URL url = new URL(fileURL);
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
int responseCode = httpConn.getResponseCode();
// always check HTTP response code first
if (responseCode == HttpURLConnection.HTTP_OK) {
String fileName = "";
String disposition = httpConn.getHeaderField("Content-Disposition");
String contentType = httpConn.getContentType();
int contentLength = httpConn.getContentLength();
if (disposition != null) {
// extracts file name from header field
int index = disposition.indexOf("filename=");
if (index > 0) {
fileName = disposition.substring(index + 10,
disposition.length() - 1);
}
} else {
// extracts file name from URL
fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
fileURL.length());
}
System.out.println("Content-Type = " + contentType);
System.out.println("Content-Disposition = " + disposition);
System.out.println("Content-Length = " + contentLength);
System.out.println("fileName = " + fileName);
// opens input stream from the HTTP connection
InputStream inputStream = httpConn.getInputStream();
String saveFilePath = saveDir + File.separator + fileName;
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream(saveFilePath);
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
} else {
System.out.println("No file to download. Server replied HTTP code: " + responseCode);
}
httpConn.disconnect();
}
}
您可以使用SeleniumWebDriver java类实现这一点
通常,webdriver用于测试,但它能够模拟用户向下滚动页面,直到页面停止更改,然后您可以使用java代码将内容保存到文件中。使用HtmlUnit库获取所有文本和图像/css文件 HTMLUnit[link]HTMLUnit.sourceforge.net 1) 要下载文本内容,请使用下面链接上的代码 所有文本内容[链接] 特定标记,如span[link]
2) 要获取图片/文件,请使用下面的[link]你能举一个这样的网站/页面的例子吗?我对你提出的问题有意义吗?我现在真的很忙,但我会尽快(在7小时内)回到这个主题。在我研究你提出的解决方案之后,你的帮助将得到回报。谢谢你的理解。太好了,成功了。然而,我在9gag.com上测试了它,它没有下载全部内容。如果在9gag上滚动大约30秒,您将到达页面底部。在此之前,有很多图像,它们的结尾.jpg或.gif都不在代码提供的下载文件中。我想你的方式可能是这里唯一暴露的方式。。。如果没有更有效的代码,那么赏金将归你所有。谢谢。有一些软件可以下载整个页面的css、js、图像和字体。但是,如果您使用的是Java程序,那么您只能下载URL中提供的内容(此处仅限HTML代码)。
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
public class HttpDownloadUtility {
private static final int BUFFER_SIZE = 4096;
/**
* Downloads a file from a URL
* @param fileURL HTTP URL of the file to be downloaded
* @param saveDir path of the directory to save the file
* @throws IOException
*/
public static void downloadFile(String fileURL, String saveDir)
throws IOException {
URL url = new URL(fileURL);
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
int responseCode = httpConn.getResponseCode();
// always check HTTP response code first
if (responseCode == HttpURLConnection.HTTP_OK) {
String fileName = "";
String disposition = httpConn.getHeaderField("Content-Disposition");
String contentType = httpConn.getContentType();
int contentLength = httpConn.getContentLength();
if (disposition != null) {
// extracts file name from header field
int index = disposition.indexOf("filename=");
if (index > 0) {
fileName = disposition.substring(index + 10,
disposition.length() - 1);
}
} else {
// extracts file name from URL
fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
fileURL.length());
}
System.out.println("Content-Type = " + contentType);
System.out.println("Content-Disposition = " + disposition);
System.out.println("Content-Length = " + contentLength);
System.out.println("fileName = " + fileName);
// opens input stream from the HTTP connection
InputStream inputStream = httpConn.getInputStream();
String saveFilePath = saveDir + File.separator + fileName;
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream(saveFilePath);
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
} else {
System.out.println("No file to download. Server replied HTTP code: " + responseCode);
}
httpConn.disconnect();
}
}