java.net.MalformedURLException:无协议：/intl/en/policies/GET请求_Java_Http_Get_Malformedurlexception

java.net.MalformedURLException:无协议：/intl/en/policies/GET请求

java http

java.net.MalformedURLException:无协议：/intl/en/policies/GET请求,java,http,get,malformedurlexception,Java,Http,Get,Malformedurlexception,我一直在做一个简单的程序，它运行在一个页面的所有链接中，然后访问它们，然后递归。但它似乎在错误运行时立即停止 java.net.MalformedURLException: no protocol: /intl/en/policies/ at java.net.URL.<init>(Unknown Source) at java.net.URL.<init>(Unknown Source) at java.net.URL.<init>(Unknown Sour

我一直在做一个简单的程序，它运行在一个页面的所有链接中，然后访问它们，然后递归。但它似乎在错误运行时立即停止

java.net.MalformedURLException: no protocol: /intl/en/policies/
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at me.dylan.WebCrawler.WebC.sendGetRequest(WebC.java:67)
at me.dylan.WebCrawler.WebC.<init>(WebC.java:27)
at me.dylan.WebCrawler.WebC.main(WebC.java:36)

java.net.MalformedURLException:无协议：/intl/en/policies/
位于java.net.URL。（未知源）
位于java.net.URL。（未知源）
位于java.net.URL。（未知源）
at me.dylan.WebCrawler.WebC.sendGetRequest（WebC.java:67）
dylan.WebCrawler.WebC.（WebC.java:27）
at me.dylan.WebCrawler.WebC.main（WebC.java:36）

我的代码：

package me.dylan.WebCrawler;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;

import javax.swing.text.BadLocationException;
import javax.swing.text.EditorKit;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLDocument;
import javax.swing.text.html.HTMLEditorKit;

public class WebC {
//  FileUtil f;
    int linkamount=0;
    ArrayList<URL> visited = new ArrayList<URL>();
    ArrayList<String> urls = new ArrayList<String>();
    public WebC() {

        try {
//          f= new FileUtil();
            sendGetRequest("http://www.google.com");
        } catch (IOException e) {
            e.printStackTrace();
        }
        catch (BadLocationException e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) {
        new WebC();
    }
    public void sendGetRequest(String path) throws IOException, BadLocationException, MalformedURLException {

        URL url = new URL(path);
        HttpURLConnection con = (HttpURLConnection) url.openConnection();
        con.setRequestMethod("GET");
        con.setRequestProperty("Content-Language", "en-US");
         BufferedReader rd = new BufferedReader(new InputStreamReader(con.getInputStream()));
         EditorKit kit = new HTMLEditorKit();
         HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
         doc.putProperty("IgnoreCharsetDirective", new Boolean(true));
         kit.read(rd, doc, 0);

         //Get all <a> tags (hyperlinks)
         HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
         while (it.isValid())
         {
             MutableAttributeSet mas = (MutableAttributeSet)it.getAttributes();
             //get the HREF attribute value in the <a> tag
             String link = (String)mas.getAttribute(HTML.Attribute.HREF);
             if(link!=null && link!="") {
                 urls.add(link);
             }

             it.next();
         }
         for(int i=urls.size()-1;i>=0;i--) {
             if(urls.get(i)!=null) {
                if(/*f.searchforString(urls.get(i)) ||*/ visited.contains(new URL(urls.get(i)))) {
                    urls.remove(i);
                    continue;
                } else {
                    System.out.println(linkamount++);
                    System.out.println(path);
                    visited.add(new URL(path));
                    //f.write(urls.get(i));
                    sendGetRequest(urls.get(i));
                }
                 try {
                    Thread.sleep(100);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
             }
         }           
    }
}

package me.dylan.WebCrawler；
导入java.io.BufferedReader；
导入java.io.IOException；
导入java.io.InputStreamReader；
导入java.net.HttpURLConnection；
导入java.net.MalformedURLException；
导入java.net.URL；
导入java.util.ArrayList；
导入javax.swing.text.BadLocationException；
导入javax.swing.text.EditorKit；
导入javax.swing.text.MutableAttributeSet；
导入javax.swing.text.html.html；
导入javax.swing.text.html.HTMLDocument；
导入javax.swing.text.html.HTMLEditorKit；
公开课网络广播{
//FileUtil f；
int linkamount=0；
ArrayList visited=新建ArrayList（）；
ArrayList URL=新的ArrayList（）；
公共网络{
试一试{
//f=新的FileUtil（）；
sendGetRequest（“http://www.google.com");
}捕获（IOE异常）{
e、 printStackTrace（）；
}
捕获（错误位置异常e）{
e、 printStackTrace（）；
}
}
公共静态void main（字符串[]args）{
新WebC（）；
}
public void sendGetRequest（字符串路径）引发IOException、BadLocationException、MalformedUrlexException{
URL=新URL（路径）；
HttpURLConnection con=（HttpURLConnection）url.openConnection（）；
con.setRequestMethod（“GET”）；
con.setRequestProperty（“内容语言”、“美国英语”）；
BufferedReader rd=新的BufferedReader（新的InputStreamReader（con.getInputStream（））；
EditorKit=新的HTMLEditorKit（）；
HTMLDocument doc=（HTMLDocument）kit.createDefaultDocument（）；
doc.putProperty（“IgnoreCharsetDirective”，新布尔值（true））；
套件读取（rd，doc，0）；
//获取所有标记（超链接）
HTMLDocument.Iterator it=doc.getIterator（HTML.Tag.A）；
while（it.isValid（））
{
MutableAttributeSet mas=（MutableAttributeSet）it.getAttributes（）；
//获取标记中的HREF属性值
字符串链接=（字符串）mas.getAttribute（HTML.Attribute.HREF）；
if（link！=null&&link！=“”）{
添加（链接）；
}
it.next（）；
}
对于（int i=url.size（）-1；i>=0；i--）{
if（url.get（i）！=null）{
if（/*f.searchforString（URL.get（i））| |*/visitored.contains（新URL（URL.get（i）））{
删除（i）；
继续；
}否则{
System.out.println（linkamount++）；
System.out.println（路径）；
添加（新URL（路径））；
//f、 写（url.get（i））；
sendGetRequest（url.get（i））；
}
试一试{
睡眠（100）；
}捕捉（中断异常e）{
e、 printStackTrace（）；
}
}
}           
}
}

老实说，我不知道如何解决这个问题。显然google有一个href标记不是有效的url，我该如何解决这个问题？

您必须在url部分附加baseURl。URL对象需要它的格式

而表单的格式将是相对格式。简单的解决方法是在调用前将get附加到您得到的每个HREF中。

一个快速修复方法是在调用前将

URL.get（i）

附加到

requestPath

。这将给它一个协议和一个域来使用。唯一的问题是，如果不扫描循环中的当前url以查找协议和域，则可能会出现以下情况：

http://www.google.com/http://www.yahoo.com/policies

另一个问题出现了，代码是：我得到了：问题是www.google.com结尾的“.”。您希望在打开连接之前检查此类边界情况。在这种情况下，您希望在附加到链接之前检查HREF是否包含“”。