Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/maven/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 将HTML解析为对象_Java_Jsoup - Fatal编程技术网

Java 将HTML解析为对象

Java 将HTML解析为对象,java,jsoup,Java,Jsoup,我尝试使用jsoup将以下html解析为Java中的对象 我试图遍历元素并提取所有“类”作为对象来生成时间表数据。每个“班级”都有时间、地点、讲师和描述等,但这不是问题所在。 所有元素都属于类tt\U详细信息。每天没有特定的父子关系,但是我可以使用Elements dayNames=content.getElementsByClass(“tt_day”)提取所涉及的天数 每天可以有不同数量的“类”,正如你所看到的,周一有3个“类”,周二有,所以正常的循环结构不起作用。我怎样才能做到这一点 <

我尝试使用jsoup将以下html解析为Java中的对象

我试图遍历元素并提取所有“类”作为对象来生成时间表数据。每个“班级”都有时间、地点、讲师和描述等,但这不是问题所在。 所有元素都属于类
tt\U详细信息
。每天没有特定的父子关系,但是我可以使用
Elements dayNames=content.getElementsByClass(“tt_day”)提取所涉及的天数

每天可以有不同数量的“类”,正如你所看到的,周一有3个“类”,周二有,所以正常的循环结构不起作用。我怎样才能做到这一点

<div class='tt_details'>
    <div class='tt_day'>Mon</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 13:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>
    <div class='tt_lecturer'>Loftus, M</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>13:00 - 14:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 18:00
        <div class='tt_day_small'> (Mon)</div>
    </div>
    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_day'>Tue</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>09:00 - 10:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>10:00 - 11:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>
    <div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>11:00 - 12:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
    <div class='tt_timeslot'>16:00 - 17:00
        <div class='tt_day_small'> (Tue)</div>
    </div>
    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
    <div class='tt_lecturer'>Lang, D</div>
</div>

周一
11:00 - 13:00
(星期一)
物联网E1010-MAC实验室
洛夫特斯,M 13:00 - 14:00 (星期一) 计算机系统和网络
A0004-分层演讲厅(132座)
朗,D 16:00 - 18:00 (星期一) 编程简介L8
D2005-计算机实验室(32)
金塞拉五世 星期二 09:00 - 10:00 (星期二) 数学2
A0004-分层演讲厅(132)
奥里根,D 10:00 - 11:00 (星期二) 数学2
E0017-分层教室(106)
奥里根,D 11:00 - 12:00 (星期二) 编程简介A0006-分层演讲厅(152)
金塞拉五世 16:00 - 17:00 (星期二) 计算机系统和网络
A0006-分层演讲厅(152)
朗,D
如果这是一个在线页面的HTML源代码,那么您可以使用selenium实现这些目的,为此您必须导入selenium JAR

我的建议是:

String datentime = driver.findElement(By.className("tt_timeslot")).getText(); 

若元素的名称相同,则使用唯一id或css选择器或XPath。

若这是来自在线页面的HTML源,则可以使用selenium实现此目的,为此,必须导入selenium JAR

我的建议是:

String datentime = driver.findElement(By.className("tt_timeslot")).getText(); 
如果元素的名称相同,请使用唯一id或css选择器或XPath。

试试这个

static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };

结果

*** Mon ***
    --------
         tt_timeslot : 11:00 - 13:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Internet of Things E1010 - MAC Lab
         tt_lecturer : Loftus, M
    --------
         tt_timeslot : 13:00 - 14:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : Lang, D
    --------
         tt_timeslot : 16:00 - 18:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
         tt_lecturer : Kinsella,V
*** Tue ***
    --------
         tt_timeslot : 09:00 - 10:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 10:00 - 11:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 11:00 - 12:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Kinsella,V
    --------
         tt_timeslot : 16:00 - 17:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Lang, D
试试这个

static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };

结果

*** Mon ***
    --------
         tt_timeslot : 11:00 - 13:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Internet of Things E1010 - MAC Lab
         tt_lecturer : Loftus, M
    --------
         tt_timeslot : 13:00 - 14:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : Lang, D
    --------
         tt_timeslot : 16:00 - 18:00 (Mon)
        tt_day_small : (Mon)
           tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
         tt_lecturer : Kinsella,V
*** Tue ***
    --------
         tt_timeslot : 09:00 - 10:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 10:00 - 11:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
         tt_lecturer : O'Regan,D
    --------
         tt_timeslot : 11:00 - 12:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Kinsella,V
    --------
         tt_timeslot : 16:00 - 17:00 (Tue)
        tt_day_small : (Tue)
           tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
         tt_lecturer : Lang, D

类似这样的事情可能会有所帮助:

String html = ""
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Mon</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 13:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
        +"    <div class='tt_lecturer'>Loftus, M</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>13:00 - 14:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 18:00"
        +"        <div class='tt_day_small'> (Mon)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_day'>Tue</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>09:00 - 10:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>10:00 - 11:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
        +"    <div class='tt_lecturer'>O'Regan,D</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>11:00 - 12:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Kinsella,V</div>"
        +"</div>"
        +"<div class='tt_details'>"
        +"    <div class='tt_timeslot'>16:00 - 17:00"
        +"        <div class='tt_day_small'> (Tue)</div>"
        +"    </div>"
        +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
        +"    <div class='tt_lecturer'>Lang, D</div>"
        +"</div>"
        ;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
    public Course(String day, String time, String lecturer, String subject) {
        super();
        this.day = day;
        this.time = time;
        this.lecturer = lecturer;
        this.subject = subject;
    }
    public String day;
    public String time;
    public String lecturer;
    public String subject;

    public String toString(){
        return day + " : "+ time +" : "+ lecturer + " : "+ subject;
    }
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
    Element timeSlotEl = courseEl.select(".tt_timeslot").first();
    String timeSlotStr = timeSlotEl.ownText();
    String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
    String detailStr = courseEl.select(".tt_detail").first().text();
    String lecturerStr = courseEl.select(".tt_lecturer").first().text();

    Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
    List<Course> courses = coursesByDay.get(dayStr);
    if (courses == null){
        courses = new ArrayList<>();
        coursesByDay.put(dayStr, courses);
    }
    courses.add(course);
}

//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
    System.out.println(c);
}
String html=“”
+""
+“周一”
+""
+""
+"    11:00 - 13:00"
+(星期一)
+"    "
+“物联网E1010-MAC实验室
” +“洛夫特斯,M” +"" +"" +" 13:00 - 14:00" +(星期一) +" " +“计算机系统与网络
A0004-分层演讲厅(132)
” +“朗,D” +"" +"" +" 16:00 - 18:00" +(星期一) +" " +“编程简介L8
D2005-计算机实验室(32)
” +金塞拉五世 +"" +"" +“星期二” +"" +"" +" 09:00 - 10:00" +(星期二) +" " +“数学2
A0004——分层演讲厅(132)
” +“奥里根,D” +"" +"" +" 10:00 - 11:00" +(星期二) +" " +“数学2
E0017-分层教室(106)
” +“奥里根,D” +"" +"" +" 11:00 - 12:00" +(星期二) +" " +“编程简介
A0006-分层演讲厅(152)
” +金塞拉五世 +"" +"" +" 16:00 - 17:00" +(星期二) +" " +“计算机系统与网络
A0006-分层演讲厅(152)
” +“朗,D” +"" ; Document doc=Jsoup.parse(html); 元素courseEls=doc.select(“div.tt_详细信息:not(:has(div.tt_day))”; 班级课程{ 公共课程(弦乐日、弦乐时间、弦乐讲师、弦乐科目){ 超级(); this.day=天; 这个时间=时间; 这个讲师=讲师; this.subject=主语; } 公众弦乐日; 公共字符串时间; 公共弦乐演讲者; 公共字符串主题; 公共字符串toString(){ 返回日+“:“+时间+”:“+讲师+”:“+科目; } } Map coursesByDay=新建HashMap(); 用于(元素courseEl:courseEls){ Element timeSlotEl=courseEl.select(“.tt_时隙”).first(); 字符串timeSlotStr=timeSlotEl.ownText(); 字符串dayStr=timeSlotEl.select(“.tt_day_small”).first().text().trim().replace(“(”,”).replace(“),”); String detailStr=courseEl.select(“.tt_detail”).first().text(); 字符串讲师TR=courseEl.select(“.tt_讲师”).first().text(); 课程=新课程(dayStr、timeSlotStr、讲师tr、detailStr); 列出课程=coursesByDay.get(dayStr); if(courses==null){ courses=newarraylist(); coursesByDay.put(dayStr,courses); } 课程。添加(课程); } //星期二上所有课程 列出课程=每日课程。获取(“星期二”); (课程c:课程){ 系统输出打印ln(c); }
这将创建一个包含每日课程的地图。因此,地图键是日期,它包含一个球场对象列表

对此有几点看法:

  • 我使用自定义对象保存课程信息
  • 我使用选择器
    div.tt_details:not(:has(div.tt_day))
    只获取课程div,而不获取日div。这是可能的,因为有关当天的信息在课程分区中重复
  • CSS选择器用于获取详细信息
  • 请注意ownText()和text()之间的区别。这仅用于获取时间信息,不包括日期
  • 地图动态地充满了它的内容
      类似这样的东西可能会有所帮助:

      String html = ""
              +"<div class='tt_details'>"
              +"    <div class='tt_day'>Mon</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>11:00 - 13:00"
              +"        <div class='tt_day_small'> (Mon)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
              +"    <div class='tt_lecturer'>Loftus, M</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>13:00 - 14:00"
              +"        <div class='tt_day_small'> (Mon)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
              +"    <div class='tt_lecturer'>Lang, D</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>16:00 - 18:00"
              +"        <div class='tt_day_small'> (Mon)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
              +"    <div class='tt_lecturer'>Kinsella,V</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_day'>Tue</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>09:00 - 10:00"
              +"        <div class='tt_day_small'> (Tue)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
              +"    <div class='tt_lecturer'>O'Regan,D</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>10:00 - 11:00"
              +"        <div class='tt_day_small'> (Tue)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
              +"    <div class='tt_lecturer'>O'Regan,D</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>11:00 - 12:00"
              +"        <div class='tt_day_small'> (Tue)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
              +"    <div class='tt_lecturer'>Kinsella,V</div>"
              +"</div>"
              +"<div class='tt_details'>"
              +"    <div class='tt_timeslot'>16:00 - 17:00"
              +"        <div class='tt_day_small'> (Tue)</div>"
              +"    </div>"
              +"    <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
              +"    <div class='tt_lecturer'>Lang, D</div>"
              +"</div>"
              ;
      Document doc = Jsoup.parse(html);
      Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
      class Course{
          public Course(String day, String time, String lecturer, String subject) {
              super();
              this.day = day;
              this.time = time;
              this.lecturer = lecturer;
              this.subject = subject;
          }
          public String day;
          public String time;
          public String lecturer;
          public String subject;
      
          public String toString(){
              return day + " : "+ time +" : "+ lecturer + " : "+ subject;
          }
      }
      Map<String,List<Course>> coursesByDay = new HashMap<>();
      for (Element courseEl : courseEls){
          Element timeSlotEl = courseEl.select(".tt_timeslot").first();
          String timeSlotStr = timeSlotEl.ownText();
          String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
          String detailStr = courseEl.select(".tt_detail").first().text();
          String lecturerStr = courseEl.select(".tt_lecturer").first().text();
      
          Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
          List<Course> courses = coursesByDay.get(dayStr);
          if (courses == null){
              courses = new ArrayList<>();
              coursesByDay.put(dayStr, courses);
          }
          courses.add(course);
      }
      
      //get all courses on Tue
      List<Course> courses = coursesByDay.get("Tue");
      for (Course c : courses){
          System.out.println(c);
      }
      
      String html=“”
      +""
      +“周一”
      +""
      +""
      +"    11:00 - 13:00"
      +(星期一)
      +"    "
      +“物联网<