Python HTML中字符串的路径
如何生成HTML文档中文本字符串的所有路径,最好使用BeautifulSoup? 我有f.e.此代码:Python HTML中字符串的路径,python,beautifulsoup,Python,Beautifulsoup,如何生成HTML文档中文本字符串的所有路径,最好使用BeautifulSoup? 我有f.e.此代码: <DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished"> 28. february 2012 </SPAN>
<DIV class="art-info"><SPAN class="time"><SPAN class="time-date" content="2012-02-28T14:46CET" itemprop="datePublished">
28. february 2012
</SPAN>
14:46
</SPAN></DIV><DIV>
Something,<P>something else</P>continuing.
</DIV>
我已经研究了BeautifulSoup文档,但我不知道如何做。你有什么想法吗
from bs4 import BeautifulSoup
import re
file=open("input")
soup = BeautifulSoup(file)
for t in soup(text=re.compile(".")):
path = '/'.join(reversed([p.name for p in t.parentGenerator() if p]))
print path+"/"+ t.strip()
输出
[document]/html/body/div/span/span/28. february 2012
[document]/html/body/div/span/14:46
[document]/html/body/div/Something,
[document]/html/body/div/p/something else
[document]/html/body/div/continuing.
我不知道我是否明白你想做什么。请你再详细一点好吗?我想在HTML文档中生成文本字符串的所有路径。以一种简化的方式,我希望得到类似于//html/body/div/div/span/“string”的内容作为第一个找到的非标记文本,然后是f.e.html/body/div/div/span/h3/p/“text string”作为第二个非标记文本,等等。
str1 >>> //div/span/span/28. february
str2 >>> //div/span/14:46
str3 >>> //div/Something,continuing.
str4 >>> //div/p/something else
from bs4 import BeautifulSoup
import re
file=open("input")
soup = BeautifulSoup(file)
for t in soup(text=re.compile(".")):
path = '/'.join(reversed([p.name for p in t.parentGenerator() if p]))
print path+"/"+ t.strip()
[document]/html/body/div/span/span/28. february 2012
[document]/html/body/div/span/14:46
[document]/html/body/div/Something,
[document]/html/body/div/p/something else
[document]/html/body/div/continuing.