Python BS HTML解析-&;amp;打印URL字符串时忽略
考虑下面的例子Python BS HTML解析-&;amp;打印URL字符串时忽略,python,html,python-3.x,parsing,beautifulsoup,Python,Html,Python 3.x,Parsing,Beautifulsoup,考虑下面的例子 htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\ '<h3>Print/export</h3>',\ '<div class="body">',\ '<ul>',\ '<li id="coll-create_a_book"
htmlist = ['<div class="portal" role="navigation" id="p-coll-print_export">',\
'<h3>Print/export</h3>',\
'<div class="body">',\
'<ul>',\
'<li id="coll-create_a_book"><a href="/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Main+Page">Create a book</a></li>',\
'<li id="coll-download-as-rl"><a href="/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Main+Page&oldid=560327612&writer=rl">Download as PDF</a></li>',\
'<li id="t-print"><a href="/w/index.php?title=Main_Page&printable=yes" title="Printable version of this page [p]" accesskey="p">Printable version</a></li>',\
'</ul>',\
'</div>',\
'</div>',\
]
soup = __import__("bs4").BeautifulSoup("".join(htmlist), "html.parser")
for x in soup("a"):
print(x)
print(x.attrs)
print(soup.a.get_text())
我发现此输出存在以下问题:
bit始终打印第一个标记的文本print(soup.a.get_text())
- 在
输出的词典中,缺少键print(x.attrs)
的值“href”
&.
&
字符进行html编码
import html
for x in soup("a"):
print(x)
print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()})
print(x.get_text())
为什么不使用
x.get_text()
?还有&
是&
的html编码版本,我不担心。@t.m.adam当然我应该从x
获取文本,谢谢。我仍然需要&不过,代码>部分。这是挑战的一部分,我需要输出匹配。@t.m.adam快速提问。如您所见,我添加了一个解决方案,用&;替换&;,但我刚刚意识到这可能是不正确的,因为链接可能有合法的符号。我的问题是:这不太可能。由于查询字符串中不分隔参数的原因而包含符号的url很可能是一个格式错误的url。@t.m.adam Ooops,我的评论被截短了,很高兴你理解了我的问题。谢谢
import html
for x in soup("a"):
print(x)
print({k:html.escape(v, False) if k == 'href' else v for k,v in x.attrs.items()})
print(x.get_text())