如何在Python3中从url读取html_Python_Html_Url

如何在Python3中从url读取html

python html url

如何在Python3中从url读取html,python,html,url,Python,Html,Url,我看了以前类似的问题，只觉得更加困惑在Python3.4中，我希望在给定url的情况下将html页面作为字符串读取在perl中，我使用LWP:：Simple，使用get（）实现这一点 matplotlib 1.3.1的一个示例说：import urllib；u1=urllib.urlretrieve（url）。 python3找不到urlretrieve 我尝试了u1=urllib.request.urlopen（url），它似乎获取了一个HTTPResponse对象，但我无法打印它，也无法

我看了以前类似的问题，只觉得更加困惑

在Python3.4中，我希望在给定url的情况下将html页面作为字符串读取

在perl中，我使用LWP:：Simple，使用get（）实现这一点

matplotlib 1.3.1的一个示例说：

import urllib；u1=urllib.urlretrieve（url）

。 python3找不到

urlretrieve

我尝试了

u1=urllib.request.urlopen（url）

，它似乎获取了一个

HTTPResponse

对象，但我无法打印它，也无法获取它的长度或索引

u1.body

不存在。我在python3中找不到对

HTTPResponse

的描述

在

HTTPResponse

对象中是否有一个属性可以提供html页面的原始字节

（其他问题中不相关的内容包括

urllib2

，它在我的python、csv解析器中不存在。）

编辑：

#!/usr/bin/python3.5

import urllib.request

我在前面的问题中发现了部分（大部分）起作用的东西：

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

我说“部分”是因为我不想读单独的行，而只想读一个大字符串

我可以把这些行连在一起，但是每一行的前面都有一个字符“b”

那是从哪里来的

同样，我想我可以在连接之前删除第一个字符，但这确实是一个kloodge。

urllib.request.urlopen（url）.read（）

应该将原始HTML页面作为字符串返回给您。

对于python 2

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

请注意，Python3不会将html代码作为字符串读取，而是作为

bytearray

读取，因此需要使用

decode

将其转换为html代码

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

试试“请求”模块，它简单得多

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

更多信息请点击此处>

这将类似于

urllib.urlopen

使用urllib读取html页面非常简单。既然你想把它作为一个字符串来读，我就给你看

导入URL库。请求：

#!/usr/bin/python3.5

import urllib.request

准备我们的请求

request = urllib.request.Request('http://www.w3schools.com')

请求网页时始终使用“try/except”，因为很容易出错。urlopen（）请求页面。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

print(type(response))

htmlBytes = response.read()

print(type(htmlBytes))

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

Type是一个很好的函数，它可以告诉我们变量的“类型”。这里，response是一个http.response对象。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

print(type(response))

htmlBytes = response.read()

print(type(htmlBytes))

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

响应对象的read函数将html作为字节存储到变量中。type（）将再次验证这一点。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

print(type(response))

htmlBytes = response.read()

print(type(htmlBytes))

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

现在我们使用bytes变量的decode函数来获取单个字符串。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

print(type(response))

htmlBytes = response.read()

print(type(htmlBytes))

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

如果要将此字符串拆分为单独的行，可以使用split（）函数。在这个表单中，我们可以轻松地重复打印整个页面或进行任何其他处理。

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

print(type(response))

htmlBytes = response.read()

print(type(htmlBytes))

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

希望这能提供更详细的答案。Python文档和教程非常棒，我会将其作为参考，因为它将回答您可能遇到的大多数问题。

@user1067305

request.urlopen（）

，以及

read（）

方法…好的！我是这样尝试的：u2=urllib.request.urlopen（'）junk=u2.read（）print（junk）这是Python 3文档中的说明。

fp

对象有

readlines（）

方法，至少在Python版本3.6.1中是这样。假设其UTF-8编码不是一个好主意。你应该试着读一下标题，因为我无法将mystr写入文本文件。每次运行程序时都会出现此错误：

return codecs.charmap_encode（输入，self.errors，encoding_table）[0]UnicodeEncodeError:“charmap”编解码器无法对369774-369777位置的字符进行编码：字符映射到

假设其UTF-8编码不是一个好主意。你应该试着读读这本书header@CpILL抢手货我同意，虽然utf-8被广泛使用，但您可能会遇到问题。

import requests

是Python 2，不是吗？您的意思是什么？在py3中使用import libname，您是否可以指定它用于Python2？正如我检查的那样，

urllib.urlopen

不适用于Python3。