Java 将HTML解析为对象
我尝试使用jsoup将以下html解析为Java中的对象 我试图遍历元素并提取所有“类”作为对象来生成时间表数据。每个“班级”都有时间、地点、讲师和描述等,但这不是问题所在。 所有元素都属于类Java 将HTML解析为对象,java,jsoup,Java,Jsoup,我尝试使用jsoup将以下html解析为Java中的对象 我试图遍历元素并提取所有“类”作为对象来生成时间表数据。每个“班级”都有时间、地点、讲师和描述等,但这不是问题所在。 所有元素都属于类tt\U详细信息。每天没有特定的父子关系,但是我可以使用Elements dayNames=content.getElementsByClass(“tt_day”)提取所涉及的天数 每天可以有不同数量的“类”,正如你所看到的,周一有3个“类”,周二有,所以正常的循环结构不起作用。我怎样才能做到这一点 <
tt\U详细信息
。每天没有特定的父子关系,但是我可以使用Elements dayNames=content.getElementsByClass(“tt_day”)提取所涉及的天数代码>
每天可以有不同数量的“类”,正如你所看到的,周一有3个“类”,周二有,所以正常的循环结构不起作用。我怎样才能做到这一点
<div class='tt_details'>
<div class='tt_day'>Mon</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>11:00 - 13:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>
<div class='tt_lecturer'>Loftus, M</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>13:00 - 14:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
<div class='tt_lecturer'>Lang, D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>16:00 - 18:00
<div class='tt_day_small'> (Mon)</div>
</div>
<div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>
<div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
<div class='tt_day'>Tue</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>09:00 - 10:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>
<div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>10:00 - 11:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>
<div class='tt_lecturer'>O'Regan,D</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>11:00 - 12:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
<div class='tt_lecturer'>Kinsella,V</div>
</div>
<div class='tt_details'>
<div class='tt_timeslot'>16:00 - 17:00
<div class='tt_day_small'> (Tue)</div>
</div>
<div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>
<div class='tt_lecturer'>Lang, D</div>
</div>
周一
11:00 - 13:00
(星期一)
物联网E1010-MAC实验室
洛夫特斯,M
13:00 - 14:00
(星期一)
计算机系统和网络
A0004-分层演讲厅(132座)
朗,D
16:00 - 18:00
(星期一)
编程简介L8
D2005-计算机实验室(32)
金塞拉五世
星期二
09:00 - 10:00
(星期二)
数学2
A0004-分层演讲厅(132)
奥里根,D
10:00 - 11:00
(星期二)
数学2
E0017-分层教室(106)
奥里根,D
11:00 - 12:00
(星期二)
编程简介A0006-分层演讲厅(152)
金塞拉五世
16:00 - 17:00
(星期二)
计算机系统和网络
A0006-分层演讲厅(152)
朗,D
如果这是一个在线页面的HTML源代码,那么您可以使用selenium实现这些目的,为此您必须导入selenium JAR
我的建议是:
String datentime = driver.findElement(By.className("tt_timeslot")).getText();
若元素的名称相同,则使用唯一id或css选择器或XPath。若这是来自在线页面的HTML源,则可以使用selenium实现此目的,为此,必须导入selenium JAR
我的建议是:
String datentime = driver.findElement(By.className("tt_timeslot")).getText();
如果元素的名称相同,请使用唯一id或css选择器或XPath。试试这个
static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };
及
结果
*** Mon ***
--------
tt_timeslot : 11:00 - 13:00 (Mon)
tt_day_small : (Mon)
tt_detail : Internet of Things E1010 - MAC Lab
tt_lecturer : Loftus, M
--------
tt_timeslot : 13:00 - 14:00 (Mon)
tt_day_small : (Mon)
tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
tt_lecturer : Lang, D
--------
tt_timeslot : 16:00 - 18:00 (Mon)
tt_day_small : (Mon)
tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
tt_lecturer : Kinsella,V
*** Tue ***
--------
tt_timeslot : 09:00 - 10:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 10:00 - 11:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 11:00 - 12:00 (Tue)
tt_day_small : (Tue)
tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Kinsella,V
--------
tt_timeslot : 16:00 - 17:00 (Tue)
tt_day_small : (Tue)
tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Lang, D
试试这个
static final String[] DETAILS = { "tt_timeslot", "tt_day_small", "tt_detail", "tt_lecturer" };
及
结果
*** Mon ***
--------
tt_timeslot : 11:00 - 13:00 (Mon)
tt_day_small : (Mon)
tt_detail : Internet of Things E1010 - MAC Lab
tt_lecturer : Loftus, M
--------
tt_timeslot : 13:00 - 14:00 (Mon)
tt_day_small : (Mon)
tt_detail : Computer Systems & Networking A0004 - Tiered Lecture Theatre (132)
tt_lecturer : Lang, D
--------
tt_timeslot : 16:00 - 18:00 (Mon)
tt_day_small : (Mon)
tt_detail : Intro.to Programming L8 D2005 - Computer Laboratory (32)
tt_lecturer : Kinsella,V
*** Tue ***
--------
tt_timeslot : 09:00 - 10:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 A0004 - Tiered Lecture Theatre (132)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 10:00 - 11:00 (Tue)
tt_day_small : (Tue)
tt_detail : Mathematics 2 E0017 - Tiered Classroom (106)
tt_lecturer : O'Regan,D
--------
tt_timeslot : 11:00 - 12:00 (Tue)
tt_day_small : (Tue)
tt_detail : Intro to Programming A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Kinsella,V
--------
tt_timeslot : 16:00 - 17:00 (Tue)
tt_day_small : (Tue)
tt_detail : Computer Systems & Networking A0006 - Tiered Lecture Theatre (152)
tt_lecturer : Lang, D
类似这样的事情可能会有所帮助:
String html = ""
+"<div class='tt_details'>"
+" <div class='tt_day'>Mon</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 13:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
+" <div class='tt_lecturer'>Loftus, M</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>13:00 - 14:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 18:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_day'>Tue</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>09:00 - 10:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>10:00 - 11:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 12:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 17:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
public Course(String day, String time, String lecturer, String subject) {
super();
this.day = day;
this.time = time;
this.lecturer = lecturer;
this.subject = subject;
}
public String day;
public String time;
public String lecturer;
public String subject;
public String toString(){
return day + " : "+ time +" : "+ lecturer + " : "+ subject;
}
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
Element timeSlotEl = courseEl.select(".tt_timeslot").first();
String timeSlotStr = timeSlotEl.ownText();
String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
String detailStr = courseEl.select(".tt_detail").first().text();
String lecturerStr = courseEl.select(".tt_lecturer").first().text();
Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
List<Course> courses = coursesByDay.get(dayStr);
if (courses == null){
courses = new ArrayList<>();
coursesByDay.put(dayStr, courses);
}
courses.add(course);
}
//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
System.out.println(c);
}
String html=“”
+""
+“周一”
+""
+""
+" 11:00 - 13:00"
+(星期一)
+" "
+“物联网E1010-MAC实验室
”
+“洛夫特斯,M”
+""
+""
+" 13:00 - 14:00"
+(星期一)
+" "
+“计算机系统与网络
A0004-分层演讲厅(132)
”
+“朗,D”
+""
+""
+" 16:00 - 18:00"
+(星期一)
+" "
+“编程简介L8
D2005-计算机实验室(32)
”
+金塞拉五世
+""
+""
+“星期二”
+""
+""
+" 09:00 - 10:00"
+(星期二)
+" "
+“数学2
A0004——分层演讲厅(132)
”
+“奥里根,D”
+""
+""
+" 10:00 - 11:00"
+(星期二)
+" "
+“数学2
E0017-分层教室(106)
”
+“奥里根,D”
+""
+""
+" 11:00 - 12:00"
+(星期二)
+" "
+“编程简介
A0006-分层演讲厅(152)
”
+金塞拉五世
+""
+""
+" 16:00 - 17:00"
+(星期二)
+" "
+“计算机系统与网络
A0006-分层演讲厅(152)
”
+“朗,D”
+""
;
Document doc=Jsoup.parse(html);
元素courseEls=doc.select(“div.tt_详细信息:not(:has(div.tt_day))”;
班级课程{
公共课程(弦乐日、弦乐时间、弦乐讲师、弦乐科目){
超级();
this.day=天;
这个时间=时间;
这个讲师=讲师;
this.subject=主语;
}
公众弦乐日;
公共字符串时间;
公共弦乐演讲者;
公共字符串主题;
公共字符串toString(){
返回日+“:“+时间+”:“+讲师+”:“+科目;
}
}
Map coursesByDay=新建HashMap();
用于(元素courseEl:courseEls){
Element timeSlotEl=courseEl.select(“.tt_时隙”).first();
字符串timeSlotStr=timeSlotEl.ownText();
字符串dayStr=timeSlotEl.select(“.tt_day_small”).first().text().trim().replace(“(”,”).replace(“),”);
String detailStr=courseEl.select(“.tt_detail”).first().text();
字符串讲师TR=courseEl.select(“.tt_讲师”).first().text();
课程=新课程(dayStr、timeSlotStr、讲师tr、detailStr);
列出课程=coursesByDay.get(dayStr);
if(courses==null){
courses=newarraylist();
coursesByDay.put(dayStr,courses);
}
课程。添加(课程);
}
//星期二上所有课程
列出课程=每日课程。获取(“星期二”);
(课程c:课程){
系统输出打印ln(c);
}
这将创建一个包含每日课程的地图。因此,地图键是日期,它包含一个球场对象列表
对此有几点看法:
- 我使用自定义对象保存课程信息
- 我使用选择器
div.tt_details:not(:has(div.tt_day))
只获取课程div,而不获取日div。这是可能的,因为有关当天的信息在课程分区中重复
- CSS选择器用于获取详细信息
- 请注意ownText()和text()之间的区别。这仅用于获取时间信息,不包括日期李>
- 地图动态地充满了它的内容李>
类似这样的东西可能会有所帮助:
String html = ""
+"<div class='tt_details'>"
+" <div class='tt_day'>Mon</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 13:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Internet of Things<br/>E1010 - MAC Lab <br/></div>"
+" <div class='tt_lecturer'>Loftus, M</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>13:00 - 14:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 18:00"
+" <div class='tt_day_small'> (Mon)</div>"
+" </div>"
+" <div class='tt_detail'>Intro.to Programming L8<br/>D2005 - Computer Laboratory (32) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_day'>Tue</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>09:00 - 10:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>A0004 - Tiered Lecture Theatre (132) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>10:00 - 11:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Mathematics 2<br/>E0017 - Tiered Classroom (106) <br/></div>"
+" <div class='tt_lecturer'>O'Regan,D</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>11:00 - 12:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Intro to Programming<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Kinsella,V</div>"
+"</div>"
+"<div class='tt_details'>"
+" <div class='tt_timeslot'>16:00 - 17:00"
+" <div class='tt_day_small'> (Tue)</div>"
+" </div>"
+" <div class='tt_detail'>Computer Systems & Networking<br/>A0006 - Tiered Lecture Theatre (152) <br/></div>"
+" <div class='tt_lecturer'>Lang, D</div>"
+"</div>"
;
Document doc = Jsoup.parse(html);
Elements courseEls = doc.select("div.tt_details:not(:has(div.tt_day))");
class Course{
public Course(String day, String time, String lecturer, String subject) {
super();
this.day = day;
this.time = time;
this.lecturer = lecturer;
this.subject = subject;
}
public String day;
public String time;
public String lecturer;
public String subject;
public String toString(){
return day + " : "+ time +" : "+ lecturer + " : "+ subject;
}
}
Map<String,List<Course>> coursesByDay = new HashMap<>();
for (Element courseEl : courseEls){
Element timeSlotEl = courseEl.select(".tt_timeslot").first();
String timeSlotStr = timeSlotEl.ownText();
String dayStr = timeSlotEl.select(".tt_day_small").first().text().trim().replace("(", "").replace(")", "");
String detailStr = courseEl.select(".tt_detail").first().text();
String lecturerStr = courseEl.select(".tt_lecturer").first().text();
Course course = new Course(dayStr, timeSlotStr, lecturerStr, detailStr);
List<Course> courses = coursesByDay.get(dayStr);
if (courses == null){
courses = new ArrayList<>();
coursesByDay.put(dayStr, courses);
}
courses.add(course);
}
//get all courses on Tue
List<Course> courses = coursesByDay.get("Tue");
for (Course c : courses){
System.out.println(c);
}
String html=“”
+""
+“周一”
+""
+""
+" 11:00 - 13:00"
+(星期一)
+" "
+“物联网<