knime xpath节点多标记选择
我试图从html源代码中提取xml代码。源头是这样的,knime xpath节点多标记选择,xpath,knime,Xpath,Knime,我试图从html源代码中提取xml代码。源头是这样的, . . . <h5> <u>A</u> </h5> <ul class="listss"> <li> <d> <a href="link"> linktext </a> </d> </li> <li> <d> <a href="link2"> linktext2 <
.
.
.
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
任何想法,谢谢。如果您使用Python,您可以做到这一点
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
if item.tag == 'h5':
lastName = item.text
else:
links = item.getElementsByTag('a')
print (lastName,links)
请在问题中添加所需的输出,并添加一个。您的html有两个h5/ul对;第一个和第二个之间有什么区别?第一个h5标签有年份在这个例子中是A,B,C,D,在年份下面有链接列表。我只是想把这一年和它的链接组合在一起。像A和它的链接,B和它的链接等,如果它是混乱的,我可以改变第二个h5标签下的ul标签。
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
if item.tag == 'h5':
lastName = item.text
else:
links = item.getElementsByTag('a')
print (lastName,links)
A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n '}]
B []
C []
D []