Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/xpath/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用xpath从html构建文本_Xpath - Fatal编程技术网

使用xpath从html构建文本

使用xpath从html构建文本,xpath,Xpath,我从服务器上收到一个类似下面的html。我通过使用XPath exp@”//text()并将“nodeContent”值附加到字符串来重建文本部分。代码如下所示: for (int i=2; i<[resultXPathQuery count]; i++) { [mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]]; [mytext appendString

我从服务器上收到一个类似下面的html。我通过使用XPath exp@”//text()并将“nodeContent”值附加到字符串来重建文本部分。代码如下所示:

for (int i=2; i<[resultXPathQuery count]; i++) {
    [mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]];
    [mytext appendString:@"\n"];
}
考虑到空节点,如何构建文本部分?
我希望获得:

Line 1
line 2

line 3



line 4

标题
ol{margin:0;padding:0}p{margin:0}
.c0{字体大小:12pt;背景色:#ffffff;字体系列:Times New Roman}
.c6{宽度:432.0pt;背景色:#ffffff;填充:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{颜色:#aaaaa;字体系列:泰晤士报新罗马版}
.c3{color:#0000ee;文本装饰:下划线}
.c5{颜色:继承;文本装饰:继承}
.c2{字体大小:12pt;字体系列:Times New Roman}
.c4{高度:12pt}.c1{方向:ltr}
正文{color:#000000;字体大小:12pt;字体系列:Times New Roman}
h1{填充顶部:12.0pt;线条高度:1.0;文本对齐:左;颜色:#000000;字体大小:24pt;字体-系列:Times New Roman;字体重量:粗体;填充底部:12.0pt}
h2{填充顶部:11.25pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:18pt;字体系列:Times New Roman;字体重量:粗体;填充底部:11.25pt}
h3{填充顶部:12.0pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:14pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.0pt}
h4{填充顶部:12.75pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:12pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.75pt}
h5{填充顶部:12.75pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:9pt;字体系列:Times New Roman;字体重量:粗体;填充底部:12.75pt}
h6{填充顶部:18.0pt;线宽:1.0;文本对齐:左;颜色:#000000;字体大小:8pt;字体系列:Times New Roman;字体重量:粗体;填充底部:18.0pt}

头衔

第1行

第2行

第3行

第4行

编辑

实际上,我注意到html可能更“复杂”,所以选择所有的span元素或p元素是不够的。此外,在同一个p元素中可以出现更多的span元素,因此在这种情况下,我不必在字符串中创建新行

这是一个更复杂的返回html的主体:

<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
     <span>gfdgfdg</span>
     <span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
     <span class="c7">hgfh </span>
     <span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
     <span class="c0">fgffgfdgfg</span>
     <span class="c0 c11">gfgfdgfd fgd fd</span>
     <span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>

gfd

GHHGFHGFHGFHGHHGKFHJGK GHJGJKH GHJGJGJJGJJGJJGJJJGJJGJJJGJJGJGJGJGJGJGJGFFGHFJ JGJGHFJJJGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

gfd

f

gfdgfdg 汞

ghgfhgfh

gfhgffhgf

hgfh gfdgfg

fgfdgdgf

gdfgdfgfd

gfgf

fgfdg

fgffgfdgfg GFDGFD烟气脱硫fd fdgfdg

fgfdgfdgf

gfd

gfgf

GFD gfdgfd gfdgfd gdfgfd

hgfhgf

gfhgfh

hgfhgf

  • gfhg
  • hgfh hgf

    hgfhgfh


    我需要一个XPath表达式来选择p、h1、h2、…、h6、li元素,并考虑内部文本部分,以便正确检测新行和空行。

    对于上面的示例,您可以使用
    //span
    返回所有
    元素,而不管它们的内容如何。看起来您正在进行一些其他筛选,因为//text()还应该从
    返回CSS块和
    标题,首先

    我更愿意使用正则表达式:

  • 抓取body标记之间的所有内容(也可以使用XPath进行抓取)
  • 替换为

    \n
  • 条状标签

  • 是的,在for语句中,var i从2开始,因此我省略了内联css和标题。我必须做一些其他的尝试来检查服务器(GDocs)是否总是使用这样的表单。在这种情况下,如何将//span与//text()一起使用?此外,您知道如何在不包含其他元素的情况下获取span元素,例如,获取不包含img元素的所有span元素吗?@Objnewbie-
    @“nodeContent”
    应返回节点的文本值(在本例中为
    的文本值)。当没有文本值时,您必须进行实验以确定它是空字符串还是nil。要仅获取不包含img元素的span,请使用类似于
    //span[not(img)]`的内容。谢谢。我注意到返回的html可能比上面的(第一个)更“复杂”,正如您在第二个示例中看到的那样。
    <html><head><title>A title</title><style type="text/css">
    ol{margin:0;padding:0}p{margin:0}
    .c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}
    .c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}
    .c7{color:#aaaaaa;font-family:Times New Roman}
    .c3{color:#0000ee;text-decoration:underline}
    .c5{color:inherit;text-decoration:inherit}
    .c2{font-size:12pt;font-family:Times New Roman}
    .c4{height:12pt}.c1{direction:ltr}
    body{color:#000000;font-size:12pt;font-family:Times New Roman}
    h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font-  family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
    h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}
    h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
    h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
    h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
    h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style>
    </head>
    <body class="c6">
    <p class="c1"><span class="c2">A title</span></p>
    <p class="c1 c4"><span class="c2"></span></p>
    <p class="c4 c1"><span class="c2"></span></p>
    <p class="c1"><span class="c7">Line 1</span></p>
    <p class="c1"><span class="c7">line 2</span></p>
    <p class="c4 c1"><span class="c7"></span></p>
    <p class="c1"><span class="c7">line 3</span></p>
    <p class="c4 c1"><span class="c7"></span></p>
    <p class="c4 c1"><span class="c7"></span></p>
    <p class="c3 c2"><span class="c1"></span></p>
    <p class="c1"><span class="c7">line 4</span></p>
    </body></html>
    
    <body class="c13">
    <p class="c5"><span>gfgfgfd</span></p>
    <p class="c1"><span></span></p>
    <p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
    <p class="c1 c10"><span></span></p>
    <p class="c4"><span>gfgfgfd</span></p>
    <p class="c4"><span>f</span></p>
    <p class="c4">
         <span>gfdgfdg</span>
         <span class="c7">hg</span></p>
    <p class="c4"><span class="c7">ghgfhgfh</span></p>
    <p class="c4"><span class="c7">gfhgfhgf</span></p>
    <p class="c5">
         <span class="c7">hgfh </span>
         <span class="c0">gfdgfg</span></p>
    <p class="c5"><span class="c0">fgfdgfdgfd</span></p>
    <p class="c5"><span class="c0">gdfgdfgfd</span></p>
    <p class="c5"><span class="c0">gfgf</span></p>
    <p class="c1"><span class="c0"></span></p>
    <p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
    <p class="c1"><span class="c0"></span></p>
    <p class="c5"><span class="c0">fgfdgfdg</span></p>
    <p class="c5">
         <span class="c0">fgffgfdgfg</span>
         <span class="c0 c11">gfgfdgfd fgd fd</span>
         <span class="c0">fdgfdg</span></p>
    <p class="c5"><span class="c0">fgfdgfdgf</span></p>
    <p class="c5"><span class="c0">gfd</span></p>
    <p class="c5"><span class="c0">gfgf</span></p>
    <p class="c1"><span class="c0"></span></p>
    <p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
    <p class="c1"><span class="c0"></span></p>
    <ol class="c9" start="1">
    <li class="c3"><span class="c0">gfgfd</span></li>
    <li class="c3"><span class="c0">gfdgfd</span></li>
    <li class="c3"><span class="c0">gfdgfd</span></li>
    <li class="c3"><span class="c0">gdfgfd</span></li>
    </ol>
    <p class="c1"><span class="c0"></span></p>
    <p class="c5"><span class="c0">hgfhgf</span></p>
    <p class="c5"><span class="c0">gfhgfh</span></p>
    <p class="c5"><span class="c0">hgfhgf</span></p>
    <p class="c1"><span class="c0"></span></p>
    <ol class="c2" start="1">
    <li class="c3"><span class="c0">gfhg</span></li>
    <li class="c3"><span class="c0">hgfh</span></li>
    <li class="c3"><span class="c0">hgf</span></li>
    </ol>
    <p class="c1"><span class="c0"></span></p>
    <h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
    <p class="c1"><span class="c6"></span></p>
    <p class="c1"><span class="c6"></span></p>
    <p class="c1"><span class="c6"></span></p>
    </body>