Java 专门用于下载图像和文件的网络爬虫
我正在为我的一门课做作业Java 专门用于下载图像和文件的网络爬虫,java,html-parsing,jsoup,web-crawler,Java,Html Parsing,Jsoup,Web Crawler,我正在为我的一门课做作业 HttpConnection mimeConn =null; Response mimeResponse = null; for(Element link: links){ String linkurl =link.absUrl("href"); if(!linkurl.contains("#")){ if(DownloadRepository.curlExists(link.absUr
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
我应该写一个webcrawler,从给定特定爬网深度的网站下载文件和图像
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
我被允许使用第三方解析api,因此我使用的是Jsoup。我还尝试了HTMLPasser。两款软件都很不错,但都不完美
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
在处理url之前,我使用了默认java URLConnection来检查内容类型,但随着链接数量的增加,它会变得非常缓慢
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
问题:有人知道图像和链接的专用解析器api吗?
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
我可以开始用Jsoup写我的,但我很懒。此外,如果有一个可行的解决方案,为什么要重新发明轮子呢?任何帮助都将不胜感激
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
我需要在循环链接时检查contentType,以有效地检查链接是否指向文件,但Jsoup没有我需要的内容。以下是我所拥有的:
**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
更新
根据Yoshi的回答,我能够让我的代码正常工作。以下是链接:
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
使用我认为这个API对于您的目的来说已经足够好了。你也可以在这个网站上找到好的食谱
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
几个步骤:
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
例如
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
您只能使用一行代码来获取DOM对象:
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
而不是此代码:
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
更新1
尝试在下一行添加代码:
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
如果你在sameJava开发中的懒散检查
wget
很大程度上是为了研究,为给定的问题域找到最好的API,并使用它来解决你的问题。当然,要懒惰,不要重新发明轮子,但不要懒得不做自己的研究。我需要有效地检查contentType,但Jsoup没有我需要的。