使用python的堆栈实现提取html标记
一次读取文件中的一个字符,忽略所有要获取“”或空白的内容(也忽略“>”) 预期的输出应该是:[…html,body,h1,/h1,/h2,/body,…] 几乎可以从文档中获取所有标记使用python的堆栈实现提取html标记,python,html,python-3.x,html-parsing,Python,Html,Python 3.x,Html Parsing,一次读取文件中的一个字符,忽略所有要获取“”或空白的内容(也忽略“>”) 预期的输出应该是:[…html,body,h1,/h1,/h2,/body,…] 几乎可以从文档中获取所有标记 <html> <head> <title>Title</title> </head> <body> <p><strong><em>Q2. HTML TAG CHECKER</em>
<html>
<head>
<title>Title</title>
</head>
<body>
<p><strong><em>Q2. HTML TAG CHECKER</em></strong></p>
<p></p>
<p>A <em>markup language</em> is a language that annotates text so that the
computer can manipulate the text. Most markup languages are human readable
because the annotations are written in a way to distinguish them from the
text. The most important feature of a markup language is that the
<em>tags</em> it uses to indicate annotations should be easy to distinguish
from the document <em>content</em>.</p>
<p>One of the most well-known markup languages is the one commonly used to
create web pages, called <strong>HTML</strong>, or "Hypertext Markup
Language". In HTML, tags appear in "angle brackets" such as in
"<html>". When you load a Web page in your browser, you do not see
the tags themselves: the browser interprets the tags as instructions on how
to format the text for display.</p>
<p>Most tags in HTML are used in pairs to indicate where an effect starts
and ends. For example:</p>
<p><p>
this is a paragraph of text written in HTML
</p></p>
<p>Here <p> represents the start of a paragraph, and </p>
indicates where that paragraph ends.</p>
<p>Other tags include <b> and </b> that are used to place the
enclosed text in <strong>bold</strong> font, and <i> and </i>
indicate that the enclosed text is <em>italic</em>.</p>
<p>Note that "end" tags look just like the "start" tags, except for the
addition of a backslash ‘/’after the < symbol.</p>
<p>Sets of tags are often nested inside other sets of tags. For example, an
<em>ordered list</em> is a list of numbered bullets. You specify the start
of an ordered list with the tag <ol>, and the end with </ol>.
Within the ordered list, you identify items to be numbered with the tags
<li> (for "list item") and </li>. For example, the following
specification:</p>
<p><ol></p>
<p><li>First item</li></p>
<p><li>Second item</li></p>
<p><li>Third item</li></p>
<p></ol></p>
<p>would result in the following:</p>
<ol>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ol>
main.py
from Stack import Stack
#Processes HTML file and returns list of HTML tag objects
def process_html_file(file_name):
tag_list = []
s =Stack()
with open(file_name, 'r') as f:
all_lines = []
# loop through all lines using f.readlines() method
for line in f.readlines():
new_line = []
# this is how you would loop through each alphabet
for chars in line:
new_line.append(chars)
all_lines.append(new_line)
为什么它会跳过
标题
和标题
?对不起,詹姆斯,我刚刚编辑了这个问题。我希望在没有“”的情况下从文档中获取所有标记,请使用此代码当前获得的内容以及调试问题所执行的步骤再次编辑。现在,您所做的只是在列表中添加几行文字。此外,您的最后一个答案已经满足了您的要求。你需要一堆吗?您可以将正则表达式结果附加到堆栈中。。。
from Stack import Stack
#Processes HTML file and returns list of HTML tag objects
def process_html_file(file_name):
tag_list = []
s =Stack()
with open(file_name, 'r') as f:
all_lines = []
# loop through all lines using f.readlines() method
for line in f.readlines():
new_line = []
# this is how you would loop through each alphabet
for chars in line:
new_line.append(chars)
all_lines.append(new_line)