java.lang.illegalargumentexception-由HttpURLConnection getHeaderFieldKey（）方法引发，host=null_Java_Html_Regex_Httpurlconnection_Htmleditorkit

java.lang.illegalargumentexception-由HttpURLConnection getHeaderFieldKey（）方法引发，host=null

java html regex

java.lang.illegalargumentexception-由HttpURLConnection getHeaderFieldKey（）方法引发，host=null,java,html,regex,httpurlconnection,htmleditorkit,Java,Html,Regex,Httpurlconnection,Htmleditorkit,我正在尝试构建一个简单的linkchecker应用程序。我从网页中提取所有href属性，并输出到文件。然后根据正则表达式检查已解析的内容以检查有效URL，并将有效URL输出到另一个文件。然后我访问这些URL并将任何断开的链接输出到第三个文件在下面的简略代码中，假设已提取HREF并将其列在页面_contents.txt中。我在此处提供该文本文件的内容： http://computing.dcu.ie/~humphrys/ http://computing.dcu.ie/~humphrys/ ht

我正在尝试构建一个简单的linkchecker应用程序。我从网页中提取所有href属性，并输出到文件。然后根据正则表达式检查已解析的内容以检查有效URL，并将有效URL输出到另一个文件。然后我访问这些URL并将任何断开的链接输出到第三个文件

在下面的简略代码中，假设已提取HREF并将其列在页面_contents.txt中。我在此处提供该文本文件的内容：

http://computing.dcu.ie/~humphrys/
http://computing.dcu.ie/~humphrys/
http://computing.dcu.ie/~humphrys/blog.html
http://computing.dcu.ie/~humphrys/teaching.html
http://computing.dcu.ie/~humphrys/research.html
http://computing.dcu.ie/~humphrys/contact.html
http://computing.dcu.ie/~humphrys/
http://computing.dcu.ie/~humphrys/ca249/
http://computing.dcu.ie/~humphrys/ca318/
http://computing.dcu.ie/~humphrys/ca425/
http://computing.dcu.ie/~humphrys/ca651/
http://w2mind.computing.dcu.ie/
http://w2mind.org/
index.html
computers.internet.html
#world
#ireland
#uk
#multimedia
#internet
http://www.pressreader.com/
http://www.pressdisplay.com/
http://www.newspaperdirect.com/
http://www.newseum.org/todaysfrontpages/
http://news.google.com/
http://news.google.com/news?ned=uk
http://news.google.com/news?ned=en_ie
http://www.google.com/alerts
http://en.wikinews.org/
http://news.yahoo.com/
http://uk.news.yahoo.com/
http://www.apimages.com/
http://en.wikipedia.org/wiki/Next_Media_Animation
http://www.youtube.com/user/NMAWorldEdition
http://www.youtube.com/user/NMANews
http://www.time.com/
http://www.newsweek.com/
http://www.economist.com/
http://www.salon.com/
http://www.tnr.com/
http://thenewrepublic.com/
http://www.nytimes.com/
http://www.nypost.com/
http://www.washingtonpost.com/
http://www.latimes.com/
http://www.wsj.com/
http://www.jpost.com/
http://www.smh.com.au/
http://www.theonion.com/
http://www.theonion.com/content/video
http://www.youtube.com/user/TheOnion
http://www.theonion.com/content/radionews
http://www.thedailymash.co.uk/
http://themire.net/
http://waterfordwhispersnews.com/
http://www.evilgerald.com/
http://www.langerland.com/
http://www.portadownnews.com/
http://www.portadownnews.com/archive.htm
http://www.irishurls.com/
http://www.irishtimes.com/
http://www.irish-times.com/
http://www.ireland.com/
http://notices.irishtimes.com/
http://www.irishtimes.com/search/
http://www.independent.ie/
http://www.unison.ie/irish_independent/
http://www.independent.ie/search/index.jsp
http://www.announcement.ie/
http://www.iannounce.co.uk/Republic-of-Ireland/52
http://www.sbpost.ie/
http://www.thepost.ie/
http://archives.tcm.ie/businesspost/
http://en.wikipedia.org/wiki/Sunday_Tribune
http://www.irishexaminer.com/
http://www.examiner.ie/
http://www.magill.ie/
http://www.villagemagazine.ie/
http://www.phoenix-magazine.com/
http://www.hotpress.com/
http://www.emigrant.ie/
http://groups.google.com/groups/dir?sel=gtype%3D0%2Cusenet%3Die&
http://www.listenlive.eu/ireland.html
http://www.rte.ie/
http://www.rte.ie/player/
http://www.rte.ie/tv/
http://www.rte.ie/news/
http://www.rte.ie/aertel/170-01.html
http://www.rte.ie/radio/
http://www.rte.ie/radio1/
http://www.rte.ie/smiltest/radio_new.smil
http://www.rte.ie/lyricfm/
http://dynamic.rte.ie/av/live/radio/lyric.smil
http://www.rte.ie/aertel/184-01.html
http://www.tv3.ie/
http://www.tg4.ie/
http://www.tnag.ie/
http://www.rte.ie/aertel/
http://www.rte.ie/aertel/103-01.html
http://www.irishtimes.com/weather/
http://www.rte.ie/weather/
http://dir.yahoo.com/Regional/Countries/United_Kingdom/News_and_Media/
http://www.thetimes.co.uk/
http://www.the-times.co.uk/
http://www.timesonline.co.uk/
http://en.wikipedia.org/wiki/The_Times
http://www.thesundaytimes.co.uk/
http://www.sunday-times.co.uk/
http://archive.timesonline.co.uk/tol/archive/
http://www.thetimes.co.uk/tto/archive/
http://www.newsint-archive.co.uk/
http://www.newstext.com.au/
http://www.telegraph.co.uk/
http://www.independent.co.uk/
http://www.guardian.co.uk/
http://en.wikipedia.org/wiki/The_Guardian
http://www.observer.co.uk/
http://observer.guardian.co.uk/
http://archive.guardian.co.uk/
http://browse.guardian.co.uk/
http://www.guardian.co.uk/Archive/
http://users.guardian.co.uk/help/search/
http://www.spectator.co.uk/
http://www.private-eye.co.uk/
http://www.newstatesman.co.uk/

我已经使用几个不同的页面运行了该程序，没有问题，但对于一个特定页面，我有以下错误消息：

Exception in thread "main" java.lang.IllegalArgumentException: protocol = http host = null
    at sun.net.spi.DefaultProxySelector.select(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.connect(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.followRedirect(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getHeaderFieldKey(Unknown Source)
    at test2.main(test2.java:77)

错误在代码的这一行

String name = http.getHeaderFieldKey(i);

对本主题前面问题的回答表明，问题在于程序将url的主机读取为空。我不知道为什么会出现这种情况（假设主机为null是问题的根源？）。导致问题的url似乎格式良好，与程序正确处理的许多其他url没有任何区别

这或多或少是我的第一个问题，因此欢迎对这个问题提出任何建设性的意见或我提出的问题

import javax.swing.text.html.*;
import javax.swing.text.Element;
import javax.swing.text.ElementIterator;
import javax.swing.text.SimpleAttributeSet;
import javax.swing.text.BadLocationException;
import java.net.*;
import java.io.*;
import java.util.regex.*;

class test2 
{
    public static void main (String args[]) throws Exception
    {
        String fileOut2 = System.getProperty("user.dir") + File.separator + "page_contents.txt";
        String fileURLOut = System.getProperty("user.dir") + File.separator + "urls.txt";
        String brokenLinks = System.getProperty("user.dir") + File.separator + "broken2.html";


        BufferedReader URLIn = new BufferedReader(new FileReader(fileOut2));
        PrintWriter URLOut = new PrintWriter(new FileWriter(fileURLOut));
        PrintWriter brokenOut = new PrintWriter(new FileWriter(brokenLinks));



        try
        {


            String urlPattern = "((https?|ftp|gopher|telnet):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";



            String x;

            while ((x = URLIn.readLine()) != null)
            {
                System.out.println("Entered while loop!");
                Pattern p =     Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);

                Matcher m = p.matcher(x);

                if (m.find())               
                {
                    URLOut.println(x.substring(m.start(0),m.end(0)));  


                    URL url = new URL(x.substring(m.start(0),m.end(0)));
                    HttpURLConnection http = (HttpURLConnection)url.openConnection();
                    http.setConnectTimeout(5000);
                    for (int i=0; ; i++) 
                    {
                        String name = http.getHeaderFieldKey(i);
                        String value = http.getHeaderField(i);

                        if (name == null && value == null)     // end of headers
                        {
                            break;         
                        }

                        if (name == null)     // first line of headers
                        {
                            if(!value.substring(9, 12).equals("200"))
                            {
                                brokenOut.println("<li><a href=\"" + url + "\">" + url + "</a>" + " " + value.substring(9, 12) + "</li>");
                            }
                        }
                        else
                        {
                            System.out.println(name + "=" + value + "!!!!!!");
                        }
                    }
                }
            }   

        } catch (MalformedURLException e) 
        {
            System.out.println("Malformed URL!!!!!");
        } catch (IOException e) 
        {
            throw new RuntimeException("IO Exception!!!!!", e);
        } finally
        {
            if (URLIn != null)
            {
                URLIn.close();
            }
            if (URLOut != null)
            {
                URLOut.close();
            }
            if (brokenOut != null)
            {
                brokenOut.close();
            }
        }
    }   
}

import javax.swing.text.html.*；
导入javax.swing.text.Element；
导入javax.swing.text.ElementIterator；
导入javax.swing.text.SimpleAttributeSet；
导入javax.swing.text.BadLocationException；
导入java.net。*；
导入java.io.*；
导入java.util.regex.*；
类test2
{
公共静态void main（字符串args[]）引发异常
{
字符串fileOut2=System.getProperty（“user.dir”）+File.separator+“page_contents.txt”；
字符串fileURLOut=System.getProperty（“user.dir”）+File.separator+“url.txt”；
字符串brokenLinks=System.getProperty（“user.dir”）+File.separator+“brokern2.html”；
BufferedReader URLIn=新的BufferedReader（新文件读取器（fileOut2））；
PrintWriter URLOut=新的PrintWriter（新的FileWriter（fileURLOut））；
PrintWriter brokenOut=新的PrintWriter（新文件写入程序（BrokenLink））；
尝试
{
字符串urlPattern=“（（https？| ftp | gopher | telnet）：（（/）|（\\\\\）+[\\w\\d:\\\@%/；$）~\\\+-=\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\；
字符串x；
而（（x=URLIn.readLine（））！=null）
{
System.out.println（“在循环期间输入！”）；
Pattern p=Pattern.compile（urlPattern，Pattern.CASE\u不区分大小写）；
匹配器m=p.Matcher（x）；
if（m.find（））
{
println（x.substring（m.start（0），m.end（0））；
URL=新URL（x.substring（m.start（0），m.end（0））；
HttpURLConnection http=（HttpURLConnection）url.openConnection（）；
http.setConnectTimeout（5000）；
对于（int i=0；i++）
{
字符串名称=http.getHeaderFieldKey（i）；
字符串值=http.getHeaderField（i）；
if（name==null&&value==null）//头的结尾
{
打破
}
if（name==null）//头的第一行
{
如果（！value.substring（9，12）.等于（“200”））
{
brokenOut.println（“”+“+value.substring（9,12）+””）；
}
}
其他的
{
System.out.println（name+“=”+value+“！！！！！”）；
}
}
}
}   
}捕获（格式错误）
{
System.out.println（“格式错误的URL！！！”；
}捕获（IOE异常）
{
抛出新的运行时异常（“IO异常！！！”，e）；
}最后
{
if（URLIn！=null）
{
URLIn.close（）；
}
if（URLOut！=null）
{
URLOut.close（）；
}
如果（断开！=null）
{
brokenOut.close（）；
}
}
}   
}

如果您想在循环中处理异常，它会给您一些从错误url恢复的机会。

谢谢，我已经编辑了我的问题，其中包括代码的节略版本，以及填充输入文件的内容。因此，凭借我对异常处理的基本知识，我刚刚设法让程序在我定义的错误消息的同一点崩溃。因此，我的研究提出了这个链接——这表明问题与相同的域URL重定向有关。“这与重定向有关，但与新域无关。当HTTP位置标头指定相对URL而不是绝对URL时，会发生这种情况。”