使用xpath从html构建文本
我从服务器上收到一个类似下面的html。我通过使用XPath exp@”//text()并将“nodeContent”值附加到字符串来重建文本部分。代码如下所示:使用xpath从html构建文本,xpath,Xpath,我从服务器上收到一个类似下面的html。我通过使用XPath exp@”//text()并将“nodeContent”值附加到字符串来重建文本部分。代码如下所示: for (int i=2; i<[resultXPathQuery count]; i++) { [mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]]; [mytext appendString
for (int i=2; i<[resultXPathQuery count]; i++) {
[mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]];
[mytext appendString:@"\n"];
}
考虑到空节点,如何构建文本部分?
我希望获得:
Line 1
line 2
line 3
line 4
标题
ol{margin:0;padding:0}p{margin:0}
.c0{字体大小:12pt;背景色:#ffffff;字体系列:Times New Roman}
.c6{宽度:432.0pt;背景色:#ffffff;填充:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{颜色:#aaaaa;字体系列:泰晤士报新罗马版}
.c3{color:#0000ee;文本装饰:下划线}
.c5{颜色:继承;文本装饰:继承}
.c2{字体大小:12pt;字体系列:Times New Roman}
.c4{高度:12pt}.c1{方向:ltr}
正文{color:#000000;字体大小:12pt;字体系列:Times New Roman}
h1{填充顶部:12.0pt;线条高度:1.0;文本对齐:左;颜色:#000000;字体大小:24pt;字体-系列:Times New Roman;字体重量:粗体;填充底部:12.0pt}
h2{填充顶部:11.25pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:18pt;字体系列:Times New Roman;字体重量:粗体;填充底部:11.25pt}
h3{填充顶部:12.0pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:14pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.0pt}
h4{填充顶部:12.75pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:12pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.75pt}
h5{填充顶部:12.75pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:9pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.75pt}
h6{填充顶部:18.0pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:8pt;字体系列:Times New Roman;字体重量:粗体;填充底部:18.0pt}
头衔
第1行
第2行
第3行
第4行
编辑
实际上,我注意到html可能更“复杂”,所以选择所有的span元素或p元素是不够的。此外,在同一个p元素中可以出现更多的span元素,因此在这种情况下,我不必在字符串中创建新行
这是一个更复杂的返回html的主体:
<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
<span>gfdgfdg</span>
<span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
<span class="c7">hgfh </span>
<span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
<span class="c0">fgffgfdgfg</span>
<span class="c0 c11">gfgfdgfd fgd fd</span>
<span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>
gfd
GHHGFHGFHGFHGHHGKFHJGK GHJGJKH GHJGJGJJGJJGJJGJJJGJJGJJJGJJGJGJGJGJGJGJGFFGHFJ JGJGHFJJJGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
gfd
f
gfdgfdg
汞
ghgfhgfh
gfhgffhgf
hgfh
gfdgfg
fgfdgdgf
gdfgdfgfd
gfgf
fgfdg
fgffgfdgfg
GFDGFD烟气脱硫fd
fdgfdg
fgfdgfdgf
gfd
gfgf
GFD
gfdgfd
gfdgfd
gdfgfd
hgfhgf
gfhgfh
hgfhgf
gfhg
hgfh
hgf
hgfhgfh
我需要一个XPath表达式来选择p、h1、h2、…、h6、li元素,并考虑内部文本部分,以便正确检测新行和空行。对于上面的示例,您可以使用
//span
返回所有
元素,而不管它们的内容如何。看起来您正在进行一些其他筛选,因为//text()还应该从
返回CSS块和标题,首先
我更愿意使用正则表达式:
抓取body标记之间的所有内容(也可以使用XPath进行抓取)
将
替换为\n
条状标签
是的,在for语句中,var i从2开始,因此我省略了内联css和标题。我必须做一些其他的尝试来检查服务器(GDocs)是否总是使用这样的表单。在这种情况下,如何将//span与//text()一起使用?此外,您知道如何在不包含其他元素的情况下获取span元素,例如,获取不包含img元素的所有span元素吗?@Objnewbie-@“nodeContent”
应返回节点的文本值(在本例中为的文本值)。当没有文本值时,您必须进行实验以确定它是空字符串还是nil。要仅获取不包含img元素的span,请使用类似于
//span[not(img)]`的内容。谢谢。我注意到返回的html可能比上面的(第一个)更“复杂”,正如您在第二个示例中看到的那样。
<html><head><title>A title</title><style type="text/css">
ol{margin:0;padding:0}p{margin:0}
.c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}
.c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{color:#aaaaaa;font-family:Times New Roman}
.c3{color:#0000ee;text-decoration:underline}
.c5{color:inherit;text-decoration:inherit}
.c2{font-size:12pt;font-family:Times New Roman}
.c4{height:12pt}.c1{direction:ltr}
body{color:#000000;font-size:12pt;font-family:Times New Roman}
h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font- family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}
h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style>
</head>
<body class="c6">
<p class="c1"><span class="c2">A title</span></p>
<p class="c1 c4"><span class="c2"></span></p>
<p class="c4 c1"><span class="c2"></span></p>
<p class="c1"><span class="c7">Line 1</span></p>
<p class="c1"><span class="c7">line 2</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c1"><span class="c7">line 3</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c3 c2"><span class="c1"></span></p>
<p class="c1"><span class="c7">line 4</span></p>
</body></html>
<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
<span>gfdgfdg</span>
<span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
<span class="c7">hgfh </span>
<span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
<span class="c0">fgffgfdgfg</span>
<span class="c0 c11">gfgfdgfd fgd fd</span>
<span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>