Python 爬虫不'；由于htmlfile=urllib.request.urlopen（URL[i]）中的错误，无法运行_Python_Python 3.x

Python 爬虫不'；由于htmlfile=urllib.request.urlopen（URL[i]）中的错误，无法运行

python python-3.x

Python 爬虫不'；由于htmlfile=urllib.request.urlopen（URL[i]）中的错误，无法运行,python,python-3.x,Python,Python 3.x,我正在尝试做一个web爬虫程序，其中用户编写一个websites.txt，Python代码输入并逐个捕获URL，然后获取页面标题 import urllib.request import re i=0 regex = "<title>(.+?)</title>" pattern = re.compile(regex) txtfl = open('websites.txt') webpgsinfile = txtfl.readlines() urls = webpgs

我正在尝试做一个web爬虫程序，其中用户编写一个websites.txt，Python代码输入并逐个捕获URL，然后获取页面标题

import urllib.request
import re

i=0

regex = "<title>(.+?)</title>"
pattern = re.compile(regex)

txtfl = open('websites.txt')
webpgsinfile = txtfl.readlines()
urls = webpgsinfile

while i< len(urls):
    htmlfile = urllib.request.urlopen(urls[i])
    htmltext = htmlfile.read()
    print(htmltext)
    titles = re.findall(pattern,htmltext)
    print(titles)
    i+=1

导入urllib.request
进口稀土
i=0
regex=“（.+？）”
pattern=re.compile（regex）
txtfl=open（'websites.txt'）
webpgsinfile=txtfl.readlines（）
url=webpgsinfile
而i


但我有一个错误：
Traceback (most recent call last):
  File "C:\Users\Vinicius\Documents\GitHub\python-crawler\scrapper-2-0.py", line 17, in <module>
    titles = re.findall(pattern,htmltext)
  File "C:\Python33\lib\re.py", line 201, in findall
    return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

回溯（最近一次呼叫最后一次）：
文件“C:\Users\Vinicius\Documents\GitHub\python crawler\scraster-2-0.py”，第17行，在
titles=re.findall（模式，htmltext）
文件“C:\Python33\lib\re.py”，第201行，findall中
返回编译（模式、标志）.findall（字符串）
TypeError:无法在类似字节的对象上使用字符串模式
将下载的HTML解码为unicode文本，或使用b'…'
字节正则表达式：
regex = b"<title>(.+?)</title>"

但是，您使用的是正则表达式，将HTML与此类表达式匹配会变得太复杂、太快
相反，使用HTML解析器，Python有多种选择。我建议您使用，一个流行的第三方库
BeautifulSoup示例：
from bs4 import BeautifulSoup

response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().get_param('charset'))
title = soup.find('title').text

由于title
标记本身不包含其他标记，因此您可以在此处使用正则表达式，但只要您尝试解析嵌套标记，就会遇到非常复杂的问题。
将下载的HTML解码为unicode文本，或使用b'…'
字节正则表达式：
regex = b"<title>(.+?)</title>"

但是，您使用的是正则表达式，将HTML与此类表达式匹配会变得太复杂、太快
相反，使用HTML解析器，Python有多种选择。我建议您使用，一个流行的第三方库
BeautifulSoup示例：
from bs4 import BeautifulSoup

response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().get_param('charset'))
title = soup.find('title').text

由于title
标记本身不包含其他标记，因此您可以在此处使用正则表达式，但只要您尝试解析嵌套标记，就会遇到非常复杂的问题。
将下载的HTML解码为unicode文本，或使用b'…'
字节正则表达式：
regex = b"<title>(.+?)</title>"

但是，您使用的是正则表达式，将HTML与此类表达式匹配会变得太复杂、太快
相反，使用HTML解析器，Python有多种选择。我建议您使用，一个流行的第三方库
BeautifulSoup示例：
from bs4 import BeautifulSoup

response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().get_param('charset'))
title = soup.find('title').text

由于title
标记本身不包含其他标记，因此您可以在此处使用正则表达式，但只要您尝试解析嵌套标记，就会遇到非常复杂的问题。
将下载的HTML解码为unicode文本，或使用b'…'
字节正则表达式：
regex = b"<title>(.+?)</title>"

但是，您使用的是正则表达式，将HTML与此类表达式匹配会变得太复杂、太快
相反，使用HTML解析器，Python有多种选择。我建议您使用，一个流行的第三方库
BeautifulSoup示例：
from bs4 import BeautifulSoup

response = urllib.request.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().get_param('charset'))
title = soup.find('title').text

由于title
标记本身不包含其他标记，因此您可以在此处使用正则表达式，但只要您尝试解析嵌套标记，就会遇到非常复杂的问题。
Martijn我正在使用python 3.3，我不想在代码中有另一个库！Martijn我正在使用python 3.3，我不想在我的代码中有另一个库！Martijn我正在使用python 3.3，我不想在我的代码中有另一个库！Martijn我正在使用python 3.3，我不想在我的代码中有另一个库！