Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何捕获mysql数据中无效的延续字节导致的UnicodeDecodeError_Mysql_Python 3.x_Utf 8_Mysql Python_Unicode String - Fatal编程技术网

如何捕获mysql数据中无效的延续字节导致的UnicodeDecodeError

如何捕获mysql数据中无效的延续字节导致的UnicodeDecodeError,mysql,python-3.x,utf-8,mysql-python,unicode-string,Mysql,Python 3.x,Utf 8,Mysql Python,Unicode String,我正在将数千万行的文本数据从mysql移动到搜索引擎,但无法成功处理其中一个检索字符串中的Unicode错误。我已经尝试显式地对检索到的字符串进行编码和解码,以使Python抛出Unicode异常,并了解问题所在 这个异常是在我的笔记本电脑上运行了数千万行之后抛出的(叹气…),但我无法捕捉它,跳过那一行,继续我想要的。mysql数据库中的所有文本都应该是utf-8 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in positi

我正在将数千万行的文本数据从mysql移动到搜索引擎,但无法成功处理其中一个检索字符串中的Unicode错误。我已经尝试显式地对检索到的字符串进行编码和解码,以使Python抛出Unicode异常,并了解问题所在

这个异常是在我的笔记本电脑上运行了数千万行之后抛出的(叹气…),但我无法捕捉它,跳过那一行,继续我想要的。mysql数据库中的所有文本都应该是utf-8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 143: invalid continuation byte
以下是我使用

+--------------------------+-----------------+

|变量名称|值|

+--------------------------+-----------------+

|字符集客户机utf8|

|字符集连接utf8|

|字符集数据库utf8|

|字符集文件系统二进制|

|字符集结果utf8|

|字符集服务器utf8|

|字符集系统utf8|

|排序规则| utf8(通用)ci|

|排序规则| utf8 |通用| ci|

|排序规则|服务器| utf8(通用)ci|

+--------------------------+-----------------+

下面的异常处理有什么问题?注意,变量“last\u feeds\u id”也没有打印出来,但这可能只是except子句不起作用的证明

last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

它抱怨hex
ED
。你是在期待急性-i:
i
?如果是这样,那么您的文本不是UTF-8编码的,而是cp1250、dec8、latin1、latin2、latin5中的一个

您的Python源代码以

# -*- coding: utf-8 -*-

此外,回顾“最佳实践”


您有
charset='utf-8'
;我不确定,但也许应该是
charset='utf8'
UTF-8
是世界上所谓的字符集。MySQL调用它的3字节子集
utf8
。注意没有破折号。

mysql数据库已经有很多年了,我没有创建它,所以它可能毕竟不是UTF-8。但是为什么我的代码不能捕获异常并继续执行呢?我使用Jupyter笔记本,所以不——编码:utf-8——我不能说我是否希望文本中出现锐-I字符,它主要是瑞典语,但可以在其他语言中引用。@mattiasostmar——所有带重音的西欧字符都会有同样的问题。你能得到那部分输入的十六进制转储吗?另请看,我添加的最后一段。你是对的。我在mysql.connector.coonnect(…charset='latin1')和.encoding('latin1'…).decoding('latin1'…)行中改为'latin1',这从来都不是异常。完成了@mattiasostmar-很好。什么是
.encoding(“…”)
调用?我的笔记里没有。(也许我在这里学到了什么?)实际上,我只添加了explicit.encoding()/.decoding()行和三个except子句来触发异常并了解问题。我是从那里学来的,但可能是我误解了这个过程,因为我对Unicode还不是很了解。
last_feeds_id = 0
for feedsid, ts, url, bid, title, html in cursor:

  try:
    # to catch UnicodeErrors and see where the prolem lies
    # from: https://mail.python.org/pipermail/python-list/2012-July/627441.html
    # also see https://stackoverflow.com/questions/28583565/str-object-has-no-attribute-decode-python-3-error

    # feeds.URL is varchar(255) in mysql
    enc_url = url.encode(encoding = 'UTF-8',errors = 'strict')
    dec_url = enc_url.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.title is varchar(600) in mysql
    enc_title = title.encode(encoding = 'UTF-8',errors = 'strict')
    dec_title = enc_title.decode(encoding = 'UTF-8',errors = 'strict')

    # texts.html is text in mysql
    enc_html = html.encode(encoding = 'UTF-8',errors = 'strict')
    dec_html = enc_html.decode(encoding = 'UTF-8',errors = 'strict')

    data = {"timestamp":ts,
            "url":dec_url,
           "bid":bid,
           "title":dec_title,
           "html":dec_html}
    es.index(index="blogposts",
            doc_type="blogpost",
            body=data)
  except UnicodeDecodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeEncodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)

  except UnicodeError as e:
    print("Last feeds id: {}".format(last_feeds_id))
    print(e)
# -*- coding: utf-8 -*-