Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/elixir/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何获取html标记?_Python - Fatal编程技术网

Python 如何获取html标记?

Python 如何获取html标记?,python,Python,假设我有这样一个文本文件: <html><head>Headline<html><head>more words </script>even more words</script> <html><head>Headline<html><head>more words </script>even more words</script> <html&

假设我有这样一个文本文件:

<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>
标题更多单词
更多的话
头条新闻
更多的话
我如何将标签放入这样的列表中:

<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>

我想这就是你想要的:

html_string = ''.join(input_file.readlines())
matches = re.findall('<.*?>', html_string)
for m in matches:
    print m
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "<%s>"%tag

    def handle_endtag(self, tag):
        print "</%s>"%tag

parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
        </script>even more words</script>
        <html><head>Headline<html><head>more words
        </script>even more words</script>
        """)
html\u string=''.join(input\u file.readlines())
matches=re.findall(“”,html_字符串)
对于匹配中的m:
打印m
希望这有帮助

Python有一个用于此的模块

下面是一些代码,可以满足您的需要:

html_string = ''.join(input_file.readlines())
matches = re.findall('<.*?>', html_string)
for m in matches:
    print m
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "<%s>"%tag

    def handle_endtag(self, tag):
        print "</%s>"%tag

parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
        </script>even more words</script>
        <html><head>Headline<html><head>more words
        </script>even more words</script>
        """)

关于SO的讨论应该会有所帮助:

这是一个问题的继续吗?如果是的话,你真的应该编辑你的另一个问题,而不是重新发帖。我想你的意思是:re.findall(“”,html_string)@JackNull:你完全正确。额外的双引号是一个打字错误,并已被修复