在java中将HTMl转换为文本后读取字符串文本内容

在java中将HTMl转换为文本后读取字符串文本内容,java,string,jsoup,html-email,Java,String,Jsoup,Html Email,我在电子邮件正文中有HTMl表单,如何在将HTMl表单转换为文本后读取字符串文本内容。有人能帮我吗 电子邮件正文-HTML表单: 电子邮件正文-HTML表单内容: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="Generator" content="Microsoft Word 15 (filtered medi

我在电子邮件正文中有HTMl表单,如何在将HTMl表单转换为文本后读取字符串文本内容。有人能帮我吗

电子邮件正文-HTML表单:

电子邮件正文-HTML表单内容:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
        {font-family:Helvetica;
        panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:"Century Gothic";
        panose-1:2 11 5 2 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}
span.EmailStyle17
        {mso-style-type:personal-compose;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-family:"Calibri",sans-serif;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal"><img width="809" height="364" style="width:8.427in;height:3.7916in" id="Picture_x0020_4" src="cid:image001.jpg@01D609B1.5BB77760"><o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Non-Contacted/Non-Qualified Leads from Ateco:&nbsp; <o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="1553" style="width:1165.0pt;margin-left:-.15pt;border-collapse:collapse">
<tbody>
<tr style="height:15.0pt">
<td width="167" nowrap="" valign="bottom" style="width:125.0pt;border:solid #8EA9DB 1.0pt;border-right:none;background:#4472C4;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Name<o:p></o:p></span></b></p>
</td>
<td width="111" nowrap="" valign="bottom" style="width:83.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;background
;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Mobile<o:p></o:p></span></b></p>
</td>
<td width="259" nowrap="" valign="bottom" style="width:194.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;backgroun
4;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Email<o:p></o:p></span></b></p>
</td>
<td width="103" nowrap="" valign="bottom" style="width:77.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;background
;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Postal Code<o:p></o:p></span></b></p>
</td>
<td width="109" nowrap="" valign="bottom" style="width:82.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;background
;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Enquiry Date<o:p></o:p></span></b></p>
</td>
<td width="239" nowrap="" valign="bottom" style="width:179.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;backgroun
4;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Lead Source<o:p></o:p></span></b></p>
</td>
<td width="261" nowrap="" valign="bottom" style="width:196.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;backgroun
4;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Dealer<o:p></o:p></span></b></p>
</td>
<td width="93" nowrap="" valign="bottom" style="width:70.0pt;border-top:solid #8EA9DB 1.0pt;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:none;background:
padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Date Sent
<o:p></o:p></span></b></p>
</td>
<td width="212" nowrap="" valign="bottom" style="width:159.0pt;border:solid #8EA9DB 1.0pt;border-left:none;background:#4472C4;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><b><span style="color:white">Preferred Model<o:p></o:p></span></b></p>
</td>
</tr>
<tr style="height:15.0pt">
<td width="167" nowrap="" valign="bottom" style="width:125.0pt;border-top:none;border-left:solid #8EA9DB 1.0pt;border-bottom:solid #8EA9DB 1.0pt;border-right:none;backgroun
2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">Test Justin<o:p></o:p></span></p>
</td>
<td width="111" nowrap="" valign="bottom" style="width:83.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">&#43;61 420 888 999<o:p></o:p></span></p>
</td>
<td width="259" nowrap="" valign="bottom" style="width:194.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">testmail@hotmail.com<o:p></o:p></span></p>
</td>
<td width="103" nowrap="" valign="bottom" style="width:77.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">4218<o:p></o:p></span></p>
</td>
<td width="109" nowrap="" valign="bottom" style="width:82.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">31-03-20<o:p></o:p></span></p>
</td>
<td width="239" nowrap="" valign="bottom" style="width:179.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">LDV Facebook - Book a Test Drive<o:p></o:p></span></p>
</td>
<td width="261" nowrap="" valign="bottom" style="width:196.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">QLD - Von Bibra Gold Coast - 554216<o:p></o:p></span></p>
</td>
<td width="93" nowrap="" valign="bottom" style="width:70.0pt;border:none;border-bottom:solid #8EA9DB 1.0pt;background:#D9E1F2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">03-04-20<o:p></o:p></span></p>
</td>
<td width="212" nowrap="" valign="bottom" style="width:159.0pt;border-top:none;border-left:none;border-bottom:solid #8EA9DB 1.0pt;border-right:solid #8EA9DB 1.0pt;backgroun
2;padding:0in 5.4pt 0in 5.4pt;height:15.0pt">
<p class="MsoNormal" align="center" style="text-align:center"><span style="color:black">T60 4WD Diesel Dual Cab Ute<o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal">Thank you,<o:p></o:p></p>
<p class="MsoNormal">Anna<o:p></o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<p class="MsoNormal"><b><span lang="EN-AU" style="color:black;mso-fareast-language:EN-AU">Anna Tupou</span></b><span lang="EN-AU" style="font-family:&quot;Helvetica&quot;,s
f;color:black;mso-fareast-language:EN-AU"><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-AU" style="color:black;mso-fareast-language:EN-AU">Call Centre Supervisor ÔÇô Lead Management</span><span lang="EN-AU" style="font-famil
Helvetica&quot;,sans-serif;color:black;mso-fareast-language:EN-AU"><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-AU" style="font-size:10.5pt;color:black;mso-fareast-language:EN-AU"><br>
</span><b><span lang="EN-AU" style="font-family:&quot;Century Gothic&quot;,sans-serif;color:black;mso-fareast-language:EN-AU"><img width="294" height="34" style="width:3.06
ght:.3541in" id="_x0038_11B48E0-2644-4F0E-A8FF-F2DD7ECD462F" src="cid:image002.jpg@01D609B1.5BB77760" alt="cid:BD091752-D740-4B3A-B050-FF52A328E5C8"></span></b><b><span lan
" style="font-family:&quot;Century Gothic&quot;,sans-serif;color:black;mso-fareast-language:EN-AU"><o:p></o:p></span></b></p>
<p class="MsoNormal"><span lang="EN-AU" style="mso-fareast-language:EN-AU"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-AU" style="font-size:10.5pt;color:black;mso-fareast-language:EN-AU">2A Hill Rd Lidcombe NSW 2141 Australia</span><span lang="EN-AU" styl
size:10.5pt;color:#005CFB;mso-fareast-language:EN-AU">
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-AU" style="font-size:10.5pt;color:#005CFB;mso-fareast-language:EN-AU">P</span><span lang="EN-AU" style="font-size:10.5pt;color:black;mso
-language:EN-AU"> ÔÇé&#43;61 2 8577 8097ÔÇé</span><span lang="EN-AU" style="font-size:10.5pt;color:#005CFB;mso-fareast-language:EN-AU">|</span><span lang="EN-AU" style="fon
0.5pt;color:black;mso-fareast-language:EN-AU">ÔÇé</span><span lang="EN-AU" style="font-size:10.5pt;color:#005CFB;mso-fareast-language:EN-AU">
 E</span><span lang="EN-AU" style="font-size:10.5pt;color:black;mso-fareast-language:EN-AU">ÔÇé</span><u><span lang="EN-AU" style="font-size:10.5pt;color:blue;mso-fareast-l
EN-AU">atupou@ateco.com.au<o:p></o:p></span></u></p>
<p class="MsoNormal"><span lang="EN-AU" style="font-size:10.5pt;color:#005CFB;mso-fareast-language:EN-AU">M&nbsp;
</span><span lang="EN-AU" style="font-size:10.5pt;mso-fareast-language:EN-AU">0407 588 506<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:10.0pt"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p><b><span style="font-size:13.5pt;font-family:webdings;color:green">P</span> <span style="font-size: 7.5pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:gree
<i>: Please consider the environment before printing this e-mail. </span></i></b></p>
<p id="disclaimer-input" style="font-family: Helvetica,Arial,sans-serif; color: gray; font-size: 7.5pt;" class="txt">
IMPORTANT NOTICE: If this e-mail is received by other than the named addressee, please notify us immediately by telephone or return e-mail and delete all copies from your c
system. This document contains information proprietary to Ateco Group and its
 affiliates or third parties to which Ateco may have a legal obligation to protect such information from unauthorised disclosure, use or duplication. Any disclosure, use or
tion of this document or the information contained herein for other than the
 specific purpose for which it was disclosed by Ateco is expressly prohibited. It is the recipient's responsibility to check this message and attachments for viruses.
</p>
<div></div>
</div>
</body>
</html>

<br>
<p> </p>

<p align="center" style="text-align:center">**********Disclaimer**********</p>

<p style="text-align:justify">&quot;This email and any
attachments are confidential and are for the intended addressee[s] only.
Unauthorised use of this communication is prohibited. If you have received this
communication in error, please notify the sender and remove them from your
system. Confidentiality is not waived or lost by reason of the mistaken
delivery to you. Please scan this email and any attachment(s) for viruses. It
is your responsibility to check them before opening&quot; </p>

<p align="center" style="text-align:center">********End of
Disclaimer*********</p>
Non-Contacted/Non-Qualified Leads from Ateco:

Name
Mobile
Email
Postal Code
Enquiry Date
Lead Source
Dealer
Date Sent
Preferred Model
Test Justin
+61 420 888 999
testmail@hotmail.com
4218
31-03-20
LDV Facebook - Book a Test Drive
QLD - Von Bibra Gold Coast - 554216
03-04-20
T60 4WD Diesel Dual Cab Ute

Thank you,
Anna


Anna Tupou
Call Centre Supervisor ÔÇô Lead Management


2A Hill Rd Lidcombe NSW 2141 Australia
P ÔÇé+61 2 8577 8097ÔÇé|ÔÇé EÔÇéatupou@ateco.com.au
M 0407 588 506


P : Please consider the environment before printing this e-mail.
IMPORTANT NOTICE: If this e-mail is received by other than the named addressee, please notify us immediately by telephone or return e-mail and delete all copies from your computer
system. This document contains information proprietary to Ateco Group and its affiliates or third parties to which Ateco may have a legal obligation to protect such information fro
m unauthorised disclosure, use or duplication. Any disclosure, use or duplication of this document or the information contained herein for other than the specific purpose for which
 it was disclosed by Ateco is expressly prohibited. It is the recipient's responsibility to check this message and attachments for viruses.


**********Disclaimer**********
"This email and any attachments are confidential and are for the intended addressee[s] only. Unauthorised use of this communication is prohibited. If you have received this communi
cation in error, please notify the sender and remove them from your system. Confidentiality is not waived or lost by reason of the mistaken delivery to you. Please scan this email
and any attachment(s) for viruses. It is your responsibility to check them before opening"
********End of Disclaimer*********

注意:我需要为例如邮政编码为4218的邮政编码制作精确的键值对

您可以使用任何DOM解析器库来解析HTML。您可以简单地为任何HTML标记提供id,然后就可以获得该元素。我建议你去图书馆

使用Jsoup库使用以下代码

在HTML中添加ID

<p id="POSTAL_CODE">4218</p>
您还可以对HTML使用属性提取,有关使用Jsoup进行属性提取的更多信息,请访问页面

有关更多信息,您可以参考本文,在本文中,他们提到了用于多种编程语言的多个HTML解析库

准确问题的代码

注意:只有当您在每行(包括标题)中有完全相同数量的p标记时,此代码才有效

Document doc = Jsoup.parse(htmlString);

List<String> keys = new ArrayList<>();
List<Map<String, String>> dataPairs = new ArrayList<>();

Elements trElements = doc.getElementsByTag("tr");

    for (int i = 0; i < trElements.size(); i++) {
    Element element = trElements.get(i);
    Elements pElements = element.getElementsByTag("p");

    Map<String, String> map = new HashMap<>();
    for (int i1 = 0; i1 < pElements.size(); i1++) {
        Element p = pElements.get(i1);
        if (i == 0) {
            keys.add(p.text());
        } else {
            map.put(keys.get(i1), p.text());
        }
    }
    dataPairs.add(map);
}
documentdoc=Jsoup.parse(htmlString);
列表键=新的ArrayList();
List dataPairs=new ArrayList();
Elements treelements=doc.getElementsByTag(“tr”);
对于(int i=0;i
请显示原始邮件正文。预处理会使其结构松散,所以避免这种情况。嗨@Nilesh,谢谢你的想法。HTML表单来自电子邮件正文(即客户端),我无法为每个键提供id。@Justin,在这种情况下,您必须使用属性解析或其他方法进行提取,您可以探索这里提到的一些数据提取技术Hi@Nilesh,你能给我举一个例子,通过使用上面的HTML表单数据(例如,从我的问题中引用“Email Body-HTML Form Content:”来提取至少一个元素吗?这对我帮助很大。考虑到你将在标记中获得所有文本数据,我没有测试这段代码,但它应该可以工作。Document doc=Jsoup.parse(htmlString);Elements pElements=doc.getElementsByTag(“p”);StringBuilder text=new StringBuilder();对于(Element p:pElements){text.append(p.text());text.append(“\n”);}Hi@Nilesh,上述jsoup解析返回的内容与我讨论中共享的内容相同(请参阅:转换后的字符串内容)(邮件正文))。我希望能像我提到的那样,从上面的邮件内容中用精确的值映射密钥。(例如,邮政编码:4218,姓名:test Justin等)。
Document doc = Jsoup.parse(htmlString);

List<String> keys = new ArrayList<>();
List<Map<String, String>> dataPairs = new ArrayList<>();

Elements trElements = doc.getElementsByTag("tr");

    for (int i = 0; i < trElements.size(); i++) {
    Element element = trElements.get(i);
    Elements pElements = element.getElementsByTag("p");

    Map<String, String> map = new HashMap<>();
    for (int i1 = 0; i1 < pElements.size(); i1++) {
        Element p = pElements.get(i1);
        if (i == 0) {
            keys.add(p.text());
        } else {
            map.put(keys.get(i1), p.text());
        }
    }
    dataPairs.add(map);
}