正则表达式上的Python类型错误_Python_Regex_Python 3.x_Typeerror

正则表达式上的Python类型错误

python regex python-3.x

正则表达式上的Python类型错误,python,regex,python-3.x,typeerror,Python,Regex,Python 3.x,Typeerror,因此，我有以下代码： url = 'http://google.com' linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>') m = urllib.request.urlopen(url) msg = m.read() links = linkregex.findall(msg) 我做错了什么？好吧，我的Python版本没有带有请求属性的urllib，但是如果我使用“urllib.urlopen（url）”，我不会返回字

因此，我有以下代码：

url = 'http://google.com'
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read()
links = linkregex.findall(msg)

我做错了什么？

好吧，我的Python版本没有带有请求属性的urllib，但是如果我使用“urllib.urlopen（url）”，我不会返回字符串，而是得到一个对象。这是类型错误。

如果您运行的是Python2.6，“urllib”中没有任何“请求”。所以第三行变成：

m = urllib.urlopen(url)

在第3版中，您应该使用：

links = linkregex.findall(str(msg))

因为“msg”是一个字节对象，而不是findall（）所期望的字符串。或者您可以使用正确的编码进行解码。例如，如果“latin1”是编码，则：

links = linkregex.findall(msg.decode("latin1"))

你为谷歌提供的url对我不起作用，所以我用

http://www.google.com/ig?hl=en

这对我有用

试试这个：

import re
import urllib.request

url="http://www.google.com/ig?hl=en"
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')
m = urllib.request.urlopen(url)
msg = m.read():
links = linkregex.findall(str(msg))
print(links)

重新导入
导入urllib.request
url=”http://www.google.com/ig?hl=en"
linkregex=re.compile（'
TypeError:无法使用字符串模式
在类似字节的对象上

我做错了什么
您在字节对象上使用了字符串模式。请改用字节模式：
linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')
                       ^
            Add the b there, it makes it into a bytes object

)正则表达式模式和字符串必须是同一类型。如果要匹配常规字符串，则需要字符串模式。如果要匹配字节字符串，则需要字节模式
在本例中，m.read（）返回一个字节字符串，因此需要一个字节模式。在Python 3中，常规字符串是unicode字符串，需要b修饰符来指定字节字符串文字：
linkregex = re.compile(b'<a\s*href=[\'|"](.*?)[\'"].*?>')

linkregex=re.compile（b'在python3中对我有用。希望这对我有所帮助
import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

导入urllib.request
进口稀土
URL=[”https://google.com","https://nytimes.com","http://CNN.com"]
i=0
正则表达式='（.+？）'
pattern=re.compile（regex）
而i

我在regex之前添加了b，将其转换为字节数组
import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1

导入urllib.request
进口稀土
URL=[”https://google.com","https://nytimes.com","http://CNN.com"]
i=0
正则表达式=b'（.+？）'
pattern=re.compile（regex）
而i
您运行的是哪一版本的Python？下面是支持这一点的文档链接：这些是2.7版本的文档。OP在评论中说他正在使用3.1.3。John，阅读这些文档。API仍然是一样的。我的观点是，您的版本没有request属性，但OP的版本有。关于类型错误的原因，您是正确的。是的，这个版本是在我给出答案后提到的。他在评论中说，他正在运行3.1.3，所以有一个请求。事实上，他后来看到了这一点。因此，我也为版本3添加了解决方案。这只在系统Python默认编码与网页编码相同时才有效。它会与python2中断吗？
import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, str(htmltext))
    print(titles)
    i+=1

import urllib.request
import re
urls = ["https://google.com","https://nytimes.com","http://CNN.com"]
i = 0
regex = b'<title>(.+?)</title>'
pattern = re.compile(regex)

while i < len(urls) :
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    titles = re.search(pattern, htmltext)
    print(titles)
    i+=1