锚定标记web爬虫中的Java-StringIndexOutOfBounds异常_Java_String_Exception_Indexoutofboundsexception

锚定标记web爬虫中的Java-StringIndexOutOfBounds异常

java string exception

锚定标记web爬虫中的Java-StringIndexOutOfBounds异常,java,string,exception,indexoutofboundsexception,Java,String,Exception,Indexoutofboundsexception,我有一个web爬虫程序，它获取URL地址，在html中为锚定标记对页面进行爬网，并在将其存储为链接之前将其转换为绝对路径。我在原始url上使用.getHost（），并将其传递给方法getAbsolutePath（），该方法检查路径是否包含http或https，或者是否以“/”开头，如果它已经是绝对路径或根路径。在任何一种情况下，我都会将一个字符串与所需内容连接起来。因此，我只剩下像http://“+hostname+'/'+hrefPath这样的东西，它可以返回像http://www.oracl

我有一个web爬虫程序，它获取URL地址，在html

中为锚定标记对页面进行爬网，并在将其存储为链接之前将其转换为绝对路径。我在原始url上使用

.getHost（）

，并将其传递给方法

getAbsolutePath（）

，该方法检查路径是否包含http或https，或者是否以“/”开头，如果它已经是绝对路径或根路径。在任何一种情况下，我都会将一个字符串与所需内容连接起来。因此，我只剩下像http://“+hostname+'/'+hrefPath这样的东西，它可以返回像

http://www.oracle.com/hrefPath

问题在于，当索引字符串以获取子字符串或绝对路径名时，我得到了一个

java.lang.StringIndexOutOfBoundsException

，但我不明白这是怎么发生的

我正在用这个测试它：

public static void main(String[] args) throws MalformedURLException, IOException, FileNotFoundException {

   CrawlerEng myCrawlerEngine = new CrawlerEng();
   myCrawlerEngine.open("http://www.oracle.com/us/corporate/features/business-by-design");
   myCrawlerEngine.follow(100);
   myCrawlerEngine.save2Html("output.html");
   }
}

以及读取数据：

//Make a connection with the specified url
   public void open(String url) throws MalformedURLException, IOException {
      address = new URL(url);
      conn = address.openConnection();
   }  

   //Loads the content of the page, finds links and stores them in the ArrayList
   public void follow() throws IOException {
      InputStream stream = conn.getInputStream();
      Scanner in = new Scanner(stream);

      in.useDelimiter("<a");
      while(in.hasNext()) {

         UrlPath current = new UrlPath(in.next());
         String absPath = current.getAbsolutePath(address.getHost());
         Link link = new Link(absPath);
         list.add(link);
      } 
      in.close();   
   }

异常发生在0处，在getAbsolutePath方法中的字符0中，这使我出于某种原因相信字符串为空。否则，如果字符串从未初始化，它不会返回空指针异常吗

为了进一步澄清，问题不是问什么是

StringIndexOutOfBounds

，而是问为什么在这种情况下会发生这种情况。谢谢

您的代码运行时似乎没有任何错误，至少是我刚才在IntelliJ中测试的简化版本。你可能想在这里包含一个堆栈跟踪。你的评论让我很紧张，我意识到在精简版中，它没有在完整版中迭代那么多链接。我减少了作为参数传递给follow方法的“limit”，它对我有效。我只需要添加一些条件，这样它就可以根据hrefGlad的内容使用更多的链接来提供帮助（我真的帮助过你吗？）。哈哈，是的，你做到了，你把我的想法放在了轨道上。你的代码看起来运行得没有任何错误，至少是我刚刚在IntelliJ中测试的简化版本。你可能想在这里包含一个堆栈跟踪。你的评论让我很紧张，我意识到在精简版中，它没有在完整版中迭代那么多链接。我减少了作为参数传递给follow方法的“limit”，它对我有效。我只需要添加一些条件，这样它将与一个更高的链接量取决于hrefGlad的内容来帮助（我真的帮助过你吗？）。哈哈，是的，你让我的想法走上正轨了

public class UrlPath {

   private String path;

   public UrlPath(String path) {

      int hrefIndex = path.indexOf("href=");
      int start = path.indexOf("\"", hrefIndex);
      int end = path.indexOf("\"", start + 1);
      this.path = path.substring(start + 1, end);
   }

   public String getAbsolutePath(String host) {

      if(path.contains("http") || path.contains("https")) 
         return path;
      else if(path.charAt(0) == '/')
         return  "http://" + host + path;
      else
         return "http://" + host + '/' + path;  
   }     
}