Python &引用;无法对类似“object”的字节使用字符串模式;当试图解析电子邮件的htmls时
所以我有一个脚本,我已经用了几天,试图从csv中获取电子邮件列表,但现在我遇到了这个障碍。代码如下:Python &引用;无法对类似“object”的字节使用字符串模式;当试图解析电子邮件的htmls时,python,python-3.x,string,byte,encode,Python,Python 3.x,String,Byte,Encode,所以我有一个脚本,我已经用了几天,试图从csv中获取电子邮件列表,但现在我遇到了这个障碍。代码如下: import sys try: import urllib.request as urllib2 except ImportError: import urllib2 import re import csv list1 = [] list2 = [] list3 = [] def addList(): with open('file.csv', 'rt') as f
import sys
try:
import urllib.request as urllib2
except ImportError:
import urllib2
import re
import csv
list1 = []
list2 = []
list3 = []
def addList():
with open('file.csv', 'rt') as f:
reader = csv.reader(f)
for row in reader:
for s in row:
list2.append(s)
def getAddress(url):
http = "http://"
https = "https://"
if http in url:
return url
elif https in url:
return url
else:
url = "http://" + url
return url
def parseAddress(url):
global list3
try:
website = urllib2.urlopen(getAddress(url))
html = website.read()
addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)
global list1
list1.append(addys)
except urllib2.HTTPError as err:
print ("Cannot retrieve URL: HTTP Error Code: "), err.code
list3.append(url)
except urllib2.URLError as err:
print ("Cannot retrive URL: ") + err.reason[1]
list3.append(url)
def execute():
global list2
addList()
totalNum = len(list2)
atNum = 1
for s in list2:
parseAddress(s)
print ("Processing ") + str(atNum) + (" out of ") + str(totalNum)
atNum = atNum + 1
print ("Completed. Emails parsed: ") + str(len(list1)) + "."
### MAIN
def main():
global list2
execute()
global list1
myFile = open("finishedFile.csv", "w+")
wr = csv.writer(myFile, quoting=csv.QUOTE_ALL)
for s in list1:
wr.writerow(s)
myFile.close
global list3
failFile = open("failedSites.csv", "w+")
write = csv.writer(failFile, quoting=csv.QUOTE_ALL)
for j in list3:
write.writerow(j)
failFile.close
main()
当我运行它时,我会得到以下错误:
Traceback (most recent call last):
File "pagescanner.py", line 85, in <module>
main()
File "pagescanner.py", line 71, in main
execute()
File "pagescanner.py", line 60, in execute
parseAddress(s)
File "pagescanner.py", line 42, in parseAddress
addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)
File "/usr/lib/python3.5/re.py", line 213, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
回溯(最近一次呼叫最后一次):
文件“pagescanner.py”,第85行,在
main()
文件“pagescanner.py”,第71行,主
执行()
文件“pagescanner.py”,第60行,执行
解析地址
parseAddress中第42行的文件“pagescanner.py”
addys=re.findall(''[a-z0-9!\$%&'*+/=?(?:\.[a-z0-9!\$%&'*+/=?'.[a-z0-9](?:[a-z0-9-].++/=?[a-z0-9](?:[a-z0-9-]*[a-z0-9])+[a-z0-9](?[a-z0-9-].[a-z0-9-].]
findall中的文件“/usr/lib/python3.5/re.py”,第213行
返回编译(模式、标志).findall(字符串)
TypeError:无法在类似字节的对象上使用字符串模式
因此,我发现我需要找出如何将html字符串编码为字节进行编码,下面泰勒的回答帮助我做到了这一点,但现在我遇到了这个错误:
Traceback (most recent call last):
File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.5/http/client.py", line 1107, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.5/http/client.py", line 1152, in _send_request
self.endheaders(body)
File "/usr/lib/python3.5/http/client.py", line 1103, in endheaders
self._send_output(message_body)
File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
self.send(msg)
File "/usr/lib/python3.5/http/client.py", line 877, in send
self.connect()
File "/usr/lib/python3.5/http/client.py", line 849, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.5/socket.py", line 712, in create_connection
raise err
File "/usr/lib/python3.5/socket.py", line 703, in create_connection
sock.connect(sa)
OSError: [Errno 22] Invalid argument
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pagescanner.py", line 39, in parseAddress
website = urllib2.urlopen(getAddress(url))
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data)
File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
'_open', req)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 1282, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.5/urllib/request.py", line 1256, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 22] Invalid argument>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pagescanner.py", line 85, in <module>
main()
File "pagescanner.py", line 71, in main
execute()
File "pagescanner.py", line 60, in execute
parseAddress(s)
File "pagescanner.py", line 51, in parseAddress
print ("Cannot retrive URL: ") + err.reason[1]
TypeError: 'OSError' object is not subscriptable
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python3.5/urllib/request.py”,第1254行,打开
h、 请求(请求获取方法(),请求选择器,请求数据,标题)
请求中的文件“/usr/lib/python3.5/http/client.py”,第1107行
self.\u发送请求(方法、url、正文、标题)
文件“/usr/lib/python3.5/http/client.py”,第1152行,在发送请求中
self.endheaders(主体)
文件“/usr/lib/python3.5/http/client.py”,第1103行,在endheaders中
自发送输出(消息体)
文件“/usr/lib/python3.5/http/client.py”,第934行,在发送输出中
self.send(msg)
文件“/usr/lib/python3.5/http/client.py”,第877行,在send中
self.connect()
文件“/usr/lib/python3.5/http/client.py”,第849行,在connect中
(self.host、self.port)、self.timeout、self.source\u地址)
文件“/usr/lib/python3.5/socket.py”,第712行,在create_connection中
提出错误
文件“/usr/lib/python3.5/socket.py”,第703行,在create_connection中
sock.connect(sa)
OSError:[Errno 22]参数无效
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
parseAddress中第39行的文件“pagescanner.py”
website=urlib2.urlopen(getAddress(url))
urlopen中的文件“/usr/lib/python3.5/urllib/request.py”,第163行
返回opener.open(url、数据、超时)
文件“/usr/lib/python3.5/urllib/request.py”,第466行,打开
响应=自身打开(请求,数据)
文件“/usr/lib/python3.5/urllib/request.py”,第484行,打开
"开放",
文件“/usr/lib/python3.5/urllib/request.py”,第444行,在调用链中
结果=func(*args)
文件“/usr/lib/python3.5/urllib/request.py”,第1282行,在http\u open中
返回self.do_open(http.client.HTTPConnection,req)
文件“/usr/lib/python3.5/urllib/request.py”,第1256行,打开
引发URL错误(err)
urllib.error.urleror:
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“pagescanner.py”,第85行,在
main()
文件“pagescanner.py”,第71行,主
执行()
文件“pagescanner.py”,第60行,执行
解析地址
文件“pagescanner.py”,第51行,在parseAddress中
打印(“无法检索URL:”)+错误原因[1]
TypeError:“OSError”对象不可下标
这是否意味着列表中的一个url不是有效的url?我以为我终于从我的csv文件中删除了所有错误的URL,但我可能需要再看一眼才能回答您的问题,您只需要正确解码响应。 不要使用
html=website.read()
tryhtml=website.read().decode('utf-8')
看
我也会推荐一些可能让你的生活更轻松的东西。
urllib.parse
使处理URL变得不那么麻烦,而且当您不可避免地在某个地方遇到错误时,往往会使事情更具可读性
requests
库也是处理HTTP请求的黄金标准,可能有助于解决标准urllib.request
中的编码混乱和其他开销
而beautifulsoup
是处理HTML的绝佳工具
谢谢!这对那个问题起到了作用,但现在我又犯了一个错误,这个错误太长了,我不得不发布另一个问题。这个密码将是我的死亡:/