Python &引用；无法对类似“object”的字节使用字符串模式；当试图解析电子邮件的htmls时_Python_Python 3.x_String_Byte_Encode

Python &引用；无法对类似“object”的字节使用字符串模式；当试图解析电子邮件的htmls时

python python-3.x string

Python &引用；无法对类似“object”的字节使用字符串模式；当试图解析电子邮件的htmls时,python,python-3.x,string,byte,encode,Python,Python 3.x,String,Byte,Encode,所以我有一个脚本，我已经用了几天，试图从csv中获取电子邮件列表，但现在我遇到了这个障碍。代码如下： import sys try: import urllib.request as urllib2 except ImportError: import urllib2 import re import csv list1 = [] list2 = [] list3 = [] def addList(): with open('file.csv', 'rt') as f

所以我有一个脚本，我已经用了几天，试图从csv中获取电子邮件列表，但现在我遇到了这个障碍。代码如下：

import sys
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2
import re
import csv

list1 = []
list2 = []
list3 = []

def addList():
    with open('file.csv', 'rt') as f:
        reader = csv.reader(f)
        for row in reader:
            for s in row:
                list2.append(s)

def getAddress(url):
    http = "http://"
    https = "https://"

    if http in url:
        return url
    elif https in url:
        return url
    else:
        url = "http://" + url
        return url

def parseAddress(url):
    global list3
    try:
      website = urllib2.urlopen(getAddress(url))
      html = website.read()

      addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)

      global list1
      list1.append(addys)

    except urllib2.HTTPError as err:
        print ("Cannot retrieve URL: HTTP Error Code: "), err.code
        list3.append(url)
    except urllib2.URLError as err:
        print ("Cannot retrive URL: ") + err.reason[1]
        list3.append(url)

def execute():
    global list2
    addList()
    totalNum = len(list2)
    atNum = 1
    for s in list2:
        parseAddress(s)
        print ("Processing ") + str(atNum) + (" out of ") + str(totalNum)
        atNum = atNum + 1

    print ("Completed. Emails parsed: ") + str(len(list1)) + "."


### MAIN

def main():
    global list2
    execute()
    global list1
    myFile = open("finishedFile.csv", "w+")
    wr = csv.writer(myFile, quoting=csv.QUOTE_ALL)
    for s in list1:
        wr.writerow(s)
    myFile.close
    global list3
    failFile = open("failedSites.csv", "w+")
    write = csv.writer(failFile, quoting=csv.QUOTE_ALL)
    for j in list3:
        write.writerow(j)
    failFile.close

main()

当我运行它时，我会得到以下错误：

    Traceback (most recent call last):
  File "pagescanner.py", line 85, in <module>
    main()
  File "pagescanner.py", line 71, in main
    execute()
  File "pagescanner.py", line 60, in execute
    parseAddress(s)
  File "pagescanner.py", line 42, in parseAddress
    addys = re.findall('''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''', html, flags=re.IGNORECASE)
  File "/usr/lib/python3.5/re.py", line 213, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

回溯（最近一次呼叫最后一次）：
文件“pagescanner.py”，第85行，在
main（）
文件“pagescanner.py”，第71行，主
执行（）
文件“pagescanner.py”，第60行，执行
解析地址
parseAddress中第42行的文件“pagescanner.py”
addys=re.findall（''[a-z0-9！\$%&'*+/=？（？：\.[a-z0-9！\$%&'*+/=？'.[a-z0-9]（？：[a-z0-9-].++/=？[a-z0-9]（？：[a-z0-9-]*[a-z0-9]）+[a-z0-9]（？[a-z0-9-].[a-z0-9-].]
findall中的文件“/usr/lib/python3.5/re.py”，第213行
返回编译（模式、标志）.findall（字符串）
TypeError:无法在类似字节的对象上使用字符串模式

因此，我发现我需要找出如何将html字符串编码为字节进行编码，下面泰勒的回答帮助我做到了这一点，但现在我遇到了这个错误：

Traceback (most recent call last):
  File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.5/http/client.py", line 1107, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1152, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1103, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.5/socket.py", line 712, in create_connection
    raise err
  File "/usr/lib/python3.5/socket.py", line 703, in create_connection
    sock.connect(sa)
OSError: [Errno 22] Invalid argument

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pagescanner.py", line 39, in parseAddress
    website = urllib2.urlopen(getAddress(url))
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.5/urllib/request.py", line 1256, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 22] Invalid argument>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pagescanner.py", line 85, in <module>
    main()
  File "pagescanner.py", line 71, in main
    execute()
  File "pagescanner.py", line 60, in execute
    parseAddress(s)
  File "pagescanner.py", line 51, in parseAddress
    print ("Cannot retrive URL: ") + err.reason[1]
TypeError: 'OSError' object is not subscriptable

回溯（最近一次呼叫最后一次）：
文件“/usr/lib/python3.5/urllib/request.py”，第1254行，打开
h、 请求（请求获取方法（），请求选择器，请求数据，标题）
请求中的文件“/usr/lib/python3.5/http/client.py”，第1107行
self.\u发送请求（方法、url、正文、标题）
文件“/usr/lib/python3.5/http/client.py”，第1152行，在发送请求中
self.endheaders（主体）
文件“/usr/lib/python3.5/http/client.py”，第1103行，在endheaders中
自发送输出（消息体）
文件“/usr/lib/python3.5/http/client.py”，第934行，在发送输出中
self.send（msg）
文件“/usr/lib/python3.5/http/client.py”，第877行，在send中
self.connect（）
文件“/usr/lib/python3.5/http/client.py”，第849行，在connect中
（self.host、self.port）、self.timeout、self.source\u地址）
文件“/usr/lib/python3.5/socket.py”，第712行，在create_connection中
提出错误
文件“/usr/lib/python3.5/socket.py”，第703行，在create_connection中
sock.connect（sa）
OSError:[Errno 22]参数无效
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
parseAddress中第39行的文件“pagescanner.py”
website=urlib2.urlopen（getAddress（url））
urlopen中的文件“/usr/lib/python3.5/urllib/request.py”，第163行
返回opener.open（url、数据、超时）
文件“/usr/lib/python3.5/urllib/request.py”，第466行，打开
响应=自身打开（请求，数据）
文件“/usr/lib/python3.5/urllib/request.py”，第484行，打开
"开放",
文件“/usr/lib/python3.5/urllib/request.py”，第444行，在调用链中
结果=func（*args）
文件“/usr/lib/python3.5/urllib/request.py”，第1282行，在http\u open中
返回self.do_open（http.client.HTTPConnection，req）
文件“/usr/lib/python3.5/urllib/request.py”，第1256行，打开
引发URL错误（err）
urllib.error.urleror：
在处理上述异常期间，发生了另一个异常：
回溯（最近一次呼叫最后一次）：
文件“pagescanner.py”，第85行，在
main（）
文件“pagescanner.py”，第71行，主
执行（）
文件“pagescanner.py”，第60行，执行
解析地址
文件“pagescanner.py”，第51行，在parseAddress中
打印（“无法检索URL:”）+错误原因[1]
TypeError:“OSError”对象不可下标

这是否意味着列表中的一个url不是有效的url？我以为我终于从我的csv文件中删除了所有错误的URL，但我可能需要再看一眼才能回答您的问题，您只需要正确解码响应。不要使用

html=website.read（）

try

html=website.read（）.decode（'utf-8'）

看

我也会推荐一些可能让你的生活更轻松的东西。

urllib.parse

使处理URL变得不那么麻烦，而且当您不可避免地在某个地方遇到错误时，往往会使事情更具可读性

requests

库也是处理HTTP请求的黄金标准，可能有助于解决标准

urllib.request

中的编码混乱和其他开销

而

beautifulsoup

是处理HTML的绝佳工具

谢谢！这对那个问题起到了作用，但现在我又犯了一个错误，这个错误太长了，我不得不发布另一个问题。这个密码将是我的死亡：/