使用python的堆栈实现提取html标记_Python_Html_Python 3.x_Html Parsing

使用python的堆栈实现提取html标记

python html python-3.x

使用python的堆栈实现提取html标记,python,html,python-3.x,html-parsing,Python,Html,Python 3.x,Html Parsing,一次读取文件中的一个字符，忽略所有要获取“”或空白的内容（也忽略“>”）预期的输出应该是：[…html，body，h1，/h1，/h2，/body，…] 几乎可以从文档中获取所有标记 <html> <head> <title>Title</title> </head> <body> <p><strong><em>Q2. HTML TAG CHECKER</em>

一次读取文件中的一个字符，忽略所有要获取“”或空白的内容（也忽略“>”）

预期的输出应该是：[…html，body，h1，/h1，/h2，/body，…]

几乎可以从文档中获取所有标记

<html>
<head>
    <title>Title</title>
</head>
<body>
    <p><strong><em>Q2. HTML TAG CHECKER</em></strong></p>
    <p></p>
    <p>A <em>markup language</em> is a language that annotates text so that the
    computer can manipulate the text. Most markup languages are human readable
    because the annotations are written in a way to distinguish them from the
    text. The most important feature of a markup language is that the
    <em>tags</em> it uses to indicate annotations should be easy to distinguish
    from the document <em>content</em>.</p>
    <p>One of the most well-known markup languages is the one commonly used to
    create web pages, called <strong>HTML</strong>, or "Hypertext Markup
    Language". In HTML, tags appear in "angle brackets" such as in
    "&lt;html&gt;". When you load a Web page in your browser, you do not see
    the tags themselves: the browser interprets the tags as instructions on how
    to format the text for display.</p>
    <p>Most tags in HTML are used in pairs to indicate where an effect starts
    and ends. For example:</p>
    <p>&lt;p&gt;
    this is a paragraph of text written in HTML
    &lt;/p&gt;</p>
    <p>Here &lt;p&gt; represents the start of a paragraph, and &lt;/p&gt;
    indicates where that paragraph ends.</p>
    <p>Other tags include &lt;b&gt; and &lt;/b&gt; that are used to place the
    enclosed text in <strong>bold</strong> font, and &lt;i&gt; and &lt;/i&gt;
    indicate that the enclosed text is <em>italic</em>.</p>
    <p>Note that "end" tags look just like the "start" tags, except for the
    addition of a backslash &lsquo;/&rsquo;after the &lt; symbol.</p>
    <p>Sets of tags are often nested inside other sets of tags. For example, an
    <em>ordered list</em> is a list of numbered bullets. You specify the start
    of an ordered list with the tag &lt;ol&gt;, and the end with &lt;/ol&gt;.
    Within the ordered list, you identify items to be numbered with the tags
    &lt;li&gt; (for "list item") and &lt;/li&gt;. For example, the following
    specification:</p>
    <p>&lt;ol&gt;</p>
    <p>&lt;li&gt;First item&lt;/li&gt;</p>
    <p>&lt;li&gt;Second item&lt;/li&gt;</p>
    <p>&lt;li&gt;Third item&lt;/li&gt;</p>
    <p>&lt;/ol&gt;</p>
    <p>would result in the following:</p>
    <ol>
        <li>First item</li>
        <li>Second item</li>
        <li>Third item</li>
    </ol>

main.py

from Stack import Stack

#Processes HTML file and returns list of HTML tag objects
def process_html_file(file_name):
    tag_list = []
    s =Stack()
    with open(file_name, 'r') as f:
        all_lines = []
        # loop through all lines using f.readlines() method
        for line in f.readlines():
            new_line = []
            # this is how you would loop through each alphabet
            for chars in line:
                new_line.append(chars)
            all_lines.append(new_line)

为什么它会跳过

标题

和

标题

？对不起，詹姆斯，我刚刚编辑了这个问题。我希望在没有“”的情况下从文档中获取所有标记，请使用此代码当前获得的内容以及调试问题所执行的步骤再次编辑。现在，您所做的只是在列表中添加几行文字。此外，您的最后一个答案已经满足了您的要求。你需要一堆吗？您可以将正则表达式结果附加到堆栈中。。。

from Stack import Stack

#Processes HTML file and returns list of HTML tag objects
def process_html_file(file_name):
    tag_list = []
    s =Stack()
    with open(file_name, 'r') as f:
        all_lines = []
        # loop through all lines using f.readlines() method
        for line in f.readlines():
            new_line = []
            # this is how you would loop through each alphabet
            for chars in line:
                new_line.append(chars)
            all_lines.append(new_line)