Python 我怎么能省略<；h>；从这个代码中删除标签？_Python_Python 3.x

Python 我怎么能省略<；h>；从这个代码中删除标签？

python python-3.x

Python 我怎么能省略<；h>；从这个代码中删除标签？,python,python-3.x,Python,Python 3.x,因此，这段代码获取一个网站，并将所有标题信息添加到列表中。如何修改列表，使程序在打印时，在单独的一行上显示列表的每一部分，并去掉标题标记 from urllib.request import urlopen address = "http://www.w3schools.com/html/html_head.asp" webPage = urlopen (address) encoding = "utf-8" list = [] for line in webPage: findH

因此，这段代码获取一个网站，并将所有标题信息添加到列表中。如何修改列表，使程序在打印时，在单独的一行上显示列表的每一部分，并去掉标题标记

from urllib.request import urlopen
address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen (address)

encoding = "utf-8"

list = []

for line in webPage:
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>')
    line = str(line, encoding)
    for startHeader in findHeader:        
        endHeader = '</'+startHeader[1:]
        if (startHeader in line) and (endHeader in line):
            content = line.split(startHeader)[1].split(endHeader)[0]
            list.append(line)
            print (list)

webPage.close()

从urllib.request导入urlopen
地址=”http://www.w3schools.com/html/html_head.asp"
webPage=urlopen（地址）
encoding=“utf-8”
列表=[]
对于网页中的行：
findHeader=（“”，“”，“”，“”，“”，“”）
line=str（行，编码）
对于findHeader中的startHeader：
endHeader='如果您不介意使用第三方软件包，请尝试将html转换为纯文本。获取列表后，可以从循环中删除print（list）
，然后执行以下操作：
for e in list:
    # .rstrip() to remove trailing '\r\n'
    print(BeautifulSoup(e.rstrip(), "html.parser").text)

但别忘了先导入BeautifulSoup：
from bs4 import BeautifulSoup

我假设您在运行这个示例（pip3安装beautifulsoup4）之前安装了bs4
此外，还可以使用正则表达式剥离html标记。但这可能比使用像bs这样的html解析更冗长、更容易出错。
对不起，我不明白您想做什么
但是，例如，您可以轻松收集dict中的所有唯一标头：
from urllib.request import urlopen
import re

address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen(address)

# get page content
response = str(webPage.read(), encoding='utf-8')

# leave only <h*> tags content
p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL)
headers = re.findall(p, response)

# headers dict
my_headers = {}

for (tag, value) in headers:
    if tag not in my_headers.keys():
        my_headers[tag] = []

    # remove all tags inside
    re.sub('<[^>]*>', '', value)

    # replace few special chars
    value = value.replace('&lt;', '<')
    value = value.replace('&gt;', '>')

    if value not in my_headers[tag]:
        my_headers[tag].append(value)

# output
print(my_headers)

从urllib.request导入urlopen
进口稀土
地址=”http://www.w3schools.com/html/html_head.asp"
webPage=urlopen（地址）
#获取页面内容
response=str（webPage.read（），encoding='utf-8'）
#只保留标签内容
p=re.compile（r'（.+？）'，re.IGNORECASE | re.DOTALL）
headers=re.findall（p，响应）
#标题命令
我的_头={}
对于标题中的（标记、值）：
如果标记不在my_头中。keys（）：
我的标题[标签]=[]
#移除内部的所有标签
re.sub（']*>'，''，值）
#更换一些特殊字符
值=值。替换（“”，“”）
如果值不在my_头[tag]中：
my_头[tag]。追加（值）
#输出
打印（my_标题）

输出：
{'h2': ['The HTML <head> Element', 'Omitting <html> and <body>?', 'Omitting <head>', 'The HTML <title> Element', 'The HTML <style> Element', 'The HTML <link> Element', 'The HTML <meta> Element', 'The HTML <script> Element', 'The HTML <base> Element', 'HTML head Elements', 'Your Suggestion:', 'Thank You For Helping Us!'], 'h4': ['Top 10 Tutorials', 'Top 10 References', 'Top 10 Examples', 'Web Certificates'], 'h1': ['HTML <span class="color_h1">Head</span>'], 'h3': ['Example', 'W3SCHOOLS EXAMS', 'COLOR PICKER', 'SHARE THIS PAGE', 'LEARN MORE:', 'HTML/CSS', 'JavaScript', 'HTML Graphics', 'Server Side', 'Web Building', 'XML Tutorials', 'HTML', 'CSS', 'XML', 'Charsets']}

{'h2'：['HTML元素'，'ommitting and'，'ommitting'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML元素'，'HTML head元素'，'Your Suggestion:'，'谢谢您的帮助！'，'h4':['Top 10 Tutorials'、'Top 10 Reference'、'Top 10 Examples'、'Web Certificates']、'h1'：['HTML Head']、'h3'：['Example'、'W3SCHOOLS Tests'、'COLOR PICKER'、'SHARE THIS PAGE'、'LEARN MORE:'、'HTML/CSS'、'JavaScript'、'HTML Graphics'、'Server Side'、'Web Building'、'XML Tutorials'、'HTML'、'CSS'、
您要求的结果没有标题标记。您的内容
变量中已经有了这些值，但是您没有将内容
添加到结果列表中，而是添加了整个原始行的行

接下来，您要求在新行上打印每个项目。为此，请首先删除循环中的print
语句。这会在每次添加一个结果时打印整个列表。接下来，在程序底部所有循环之外添加新代码：
for item in list:
    print(item)


但是，您在HTML中识别标题的技术不是很可靠。它要求在一行上同时有一对开始标记和结束标记。它还要求在一行上任何类型的标题不得超过一个。它要求每个开始标记都有一个匹配的结束标记。即使在有效的HTML中，您也不能依赖这些东西毫升
在正确的轨道上，建议漂亮的汤，但不要只用它来去除结果中的标签，你也可以用它来寻找结果。考虑下面的代码：
from bs4 import BeautifulSoup
from urllib.request import urlopen

address = "http://www.w3schools.com/html/html_head.asp"
webPage = urlopen(address)

# The list of tag names we want to find
# Just the names, not the angle brackets    
findHeader = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')

soup = BeautifulSoup(webPage, 'html.parser')
headers = soup.find_all(findHeader)
for header in headers:
    print(header.get_text())

find_all
方法接受标记名列表，并按文档顺序返回表示每个结果的tag
对象。我们将列表存储在标题中，然后打印每个标题的文本。get_text
方法仅显示标记的文本部分，而不仅忽略周围的标题标记，but还有任何嵌入的标记。（例如，在您正在抓取的页面中有一些嵌入的span
标记。）
您当前编写的一个问题是，开始/结束标题标记可能位于不同的行上。我们是否假设html始终有效？就我而言，html是否有效并不重要。