Java 使用Jsoup提取HTML数据

Java 使用Jsoup提取HTML数据,java,html,jsoup,informatica,Java,Html,Jsoup,Informatica,我有一个带有ID、TEXT等列的表,这里的TEXT是clob列,它包含HTML格式的数据 样本数据: <P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:o

我有一个带有ID、TEXT等列的表,这里的TEXT是clob列,它包含HTML格式的数据

样本数据:

<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm<o:p></o:p></SPAN></P>
<P class=00Normal style="MARGIN: 0in 0in 0pt 24.3pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<SPAN style="mso-spacerun: yes">  </SPAN>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<SPAN style="mso-spacerun: yes">  </SPAN>The following items represent the scope and visit focus areas:<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<SPAN style="FONT: 7pt 'Times New Roman'">       </SPAN></SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program<o:p></o:p></SPAN></P>
我对java知之甚少。 我可以使用jsoup获取java代码来提取数据并重新运行下面的outpu吗

<html>
 <head></head>
 <body>
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Start: 8:30 am
    <!--?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /-->
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">End: 4 pm
    <o:p></o:p></span></p> 
  <p class="00Normal" style="MARGIN: 0in 0in 0pt 24.3pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals.<span style="mso-spacerun: yes"> </span>A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below.<span style="mso-spacerun: yes"> </span>The following items represent the scope and visit focus areas:
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">1.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">SOP Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">2.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Training Program
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.9in; TEXT-INDENT: -22.5pt"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold"> 
    <o:p></o:p></span></p> 
  <p class="MsoNormal" style="MARGIN: 0in 0in 0pt 60.3pt; TEXT-INDENT: -0.25in; tab-stops: list 60.3pt; mso-list: l120 level1 lfo139"><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">3.<span style="FONT: 7pt 'Times New Roman'"> </span></span><span style="FONT-SIZE: 10pt; FONT-FAMILY: Arial; mso-bidi-font-weight: bold">Calibration/Preventive Maintenance Program
    <o:p></o:p></span></p> 
 </body>
</html>
Start: 8:30 am End: 4 pm The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas: 1. SOP Program 2. Training Program 3. Calibration/Preventive Maintenance Program
Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined below. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program

实际上,这个数据是一个样本数据。我这里没有提到带有html标记的数据。

由于信息在
标记之间划分,您必须选择所有这些标记,并逐个打印它们的文本,假设AUDIT\u SCOPE\u LOB是有效的Java
字符串

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB);
    Elements el = doc.select("p");
    for (Element e : el) {
        System.out.println(e.text());
    }

org.jsoup.nodes.Element.toString()
返回

获取此节点的外部HTML


获取此元素及其所有子元素的组合文本。 空白被规范化和修剪


因此,对整个示例调用
toString()
将返回与输出相同的结果。同样,调用
text()
将以单个字符串的形式返回所有不带标记的文本。但是,您需要的是每个文本段落的单个字符串


您的某些段落标记为空。为了获得示例中的输出,您应该首先验证每个段落是否有文本

Document doc = Jsoup.parse(AUDIT_SCOPE_LOB, "UTF-8");

for (Element p : doc.select("p"))
    if (p.hasText())
        System.out.println(p.text());
输出

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program
SOP Program
Training Program
Calibration/Preventive Maintenance Program

查看更多有关如何解析数据的示例。例如,如果要解析出有序列表,可以在类名上选择并检索列表中的第二个跨度

for (Element span : doc.select("p.MsoNormal > span:nth-child(2)")) 
     System.out.println(span.ownText());
输出

Start: 8:30 am
End: 4 pm
The goal of this visit is to conduct an intial assessment of systems and processes in place to support supply of materials for GMP clinical development and manufacturing of pharmaceuticals. A cGMP assessment of specific processes, procedures, ad the general quality systems will occur as outlined bel ow. The following items represent the scope and visit focus areas:
1. SOP Program
2. Training Program
3. Calibration/Preventive Maintenance Program
SOP Program
Training Program
Calibration/Preventive Maintenance Program

“我可以使用jsoup获取java代码来提取数据并重新运行输出,如下图所示”
——是的,但该代码的第一个源代码应该是您和您的尝试。这不是一个“请给我代码”类型的网站,而是一个问答网站。您需要学习JSoup教程,学习Java和JSoup,然后在提问之前至少尝试一个解决方案,然后通过提问展示您的尝试。这将允许你提出一个更具体、更负责任的问题。Luck.BTW,当我指的是展示你的尝试时,我指的是一次真正的尝试,一次表明你已经学习了教程的尝试,比你在上面发布的代码行更加充实。@HovercraftFullOfEels-我认为OP确实展示了解决他问题的真正尝试。如果AUDIT_SCOPE_LOB是一个字符串,那么应用这两次尝试确实给出了他提供的输出,这就是我回答他的原因。@Hovercrft充满了鳗鱼谢谢你的时间。事实上,我是一个信息资源。在为那个html问题寻找解决方案之后,我听到了Jsoup这个名字。然后我导入了Jsoup jar文件并尝试了一些东西。即使我没有像Eclipse这样的环境来编写Java代码和测试。像标记“p”,我必须处理所有的标记r8?我不知道什么是r8标记,但是如果它的结构像p标记,那么是的。在像“p”这样的问题中,我必须处理所有的标记?