knime xpath节点多标记选择

knime xpath节点多标记选择,xpath,knime,Xpath,Knime,我试图从html源代码中提取xml代码。源头是这样的, . . . <h5> <u>A</u> </h5> <ul class="listss"> <li> <d> <a href="link"> linktext </a> </d> </li> <li> <d> <a href="link2"> linktext2 <

我试图从html源代码中提取xml代码。源头是这样的,

.
.
.
<h5>
 <u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
 linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
 linktext2
</a>
</d>
</li>
</ul>
<h5>
 <u>B</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>
<h5>
 <u>C</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>
<h5>
 <u>D</u>
</h5>
<ul class="listss">
 .\
 .(SAME TAGS AS ABOVE)
 ./
</ul>
<h5>
    <u>A</u>
</h5>
<ul class="listss">
 <li>
  <d>
   <a href="link">
    linktext
   </a>
  </d>
 </li>
 <li>
  <d>
   <a href="link2">
    linktext2
   </a>
  </d>
 </li>
</ul>

任何想法,谢谢。

如果您使用Python,您可以做到这一点

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<h5>
  <u>A</u>
</h5>
<ul class="listss">
  <li>
    <d>
      <a href="link">
        linktext
      </a>
    </d>
  </li>
  <li>
    <d>
      <a href="link2">
        linktext2
      </a>
    </d>
  </li>
</ul>
<h5>
  <u>B</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>C</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>D</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
  if item.tag == 'h5':
    lastName = item.text
  else:
    links = item.getElementsByTag('a')
    print (lastName,links)

请在问题中添加所需的输出,并添加一个。您的html有两个h5/ul对;第一个和第二个之间有什么区别?第一个h5标签有年份在这个例子中是A,B,C,D,在年份下面有链接列表。我只是想把这一年和它的链接组合在一起。像A和它的链接,B和它的链接等,如果它是混乱的,我可以改变第二个h5标签下的ul标签。
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<h5>
  <u>A</u>
</h5>
<ul class="listss">
  <li>
    <d>
      <a href="link">
        linktext
      </a>
    </d>
  </li>
  <li>
    <d>
      <a href="link2">
        linktext2
      </a>
    </d>
  </li>
</ul>
<h5>
  <u>B</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>C</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>
<h5>
  <u>D</u>
</h5>
<ul class="listss">
  .\
  .(SAME TAGS AS ABOVE)
  ./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
  if item.tag == 'h5':
    lastName = item.text
  else:
    links = item.getElementsByTag('a')
    print (lastName,links)
A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n      '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n      '}]
B []
C []
D []