Python 如何从html页面中提取文本？_Python_Html_Python 3.x_Text

Python 如何从html页面中提取文本？

python html python-3.x text

Python 如何从html页面中提取文本？,python,html,python-3.x,text,Python,Html,Python 3.x,Text,例如，网页是链接：我必须有公司的名称、地址和网站。我尝试了以下方法将html转换为文本： import nltk from urllib import urlopen url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw) 但

例如，网页是链接：

我必须有公司的名称、地址和网站。我尝试了以下方法将html转换为文本：

import nltk   
from urllib import urlopen

url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

但它返回错误：

ImportError: cannot import name 'urlopen

彼得·伍德已经回答了你的问题（）

但若您想要提取数据（比如公司名称、地址和网站），那个么您需要获取HTML源代码并使用HTML解析器对其进行解析

我建议使用它获取HTML源代码，解析生成的HTML并提取所需的文本

这是一只小鹬，它会让你领先

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below 
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

您正在使用，这与非常确定一旦您让它工作起来就会感到失望不同：未实现。看见

import requests
from bs4 import BeautifulSoup

link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50"

html = requests.get(link).text

"""If you do not want to use requests then you can use the following code below 
   with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("article", {"class": "listingItem"})
for r in res:
    print("Company Name: " + r.find('a').text)
    print("Address: " + r.find("div", {'class': 'address'}).text)
    print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)