Python 修复错误的unicode字符串_Python_Unicode

Python 修复错误的unicode字符串

python unicode

Python 修复错误的unicode字符串,python,unicode,Python,Unicode,错误的unicode字符串是指其中意外编码了字节的字符串。例如：文本：ש1500;ום，Windows-1255编码：\x99\x8c\x85\x8d，Unicode:u'\u05e9\u05dc\u05d5\u05dd'，错误的Unicode:u'\x99\x8c\x85\x8d' 在解析MP3文件中的ID3标记时，我有时会碰到这样的字符串。如何修复这些字符串？（例如，将u'\x99\x8c\x85\x8d'转换为u'\u05e9\u05dc\u05d5\u05dd'）您可以使用u'\x9

错误的unicode字符串是指其中意外编码了字节的字符串。例如：

文本：

ש1500;ום

，Windows-1255编码：

\x99\x8c\x85\x8d

，Unicode:

u'\u05e9\u05dc\u05d5\u05dd'

，错误的Unicode:

u'\x99\x8c\x85\x8d'

在解析MP3文件中的ID3标记时，我有时会碰到这样的字符串。如何修复这些字符串？（例如，将

u'\x99\x8c\x85\x8d'

转换为

u'\u05e9\u05dc\u05d5\u05dd'

）

您可以使用

u'\x99\x8c\x85\x8d'

拉丁编码将


In [9]: x = u'\x99\x8c\x85\x8d'

In [10]: x.encode('latin-1')
Out[10]: '\x99\x8c\x85\x8d'

但是，这似乎不是有效的Windows-1255编码字符串。您的意思可能是'\xf9\xec\xe5\xed'
？如果是，那么
In [22]: x = u'\xf9\xec\xe5\xed'

In [23]: x.encode('latin-1').decode('cp1255')
Out[23]: u'\u05e9\u05dc\u05d5\u05dd'

将u'\xf9\xec\xe5\xed'
转换为与您发布的所需unicode匹配的u'\u05e9\u05dc\u05d5\u05dd'


如果确实要将u'\x99\x8c\x85\x8d'
转换为u'\u05e9\u05dc\u05d5\u05dd'
，则此操作会发生：
In [27]: u'\x99\x8c\x85\x8d'.encode('latin-1').decode('cp862')
Out[27]: u'\u05e9\u05dc\u05d5\u05dd'


使用以下脚本找到上述编码/解码链：
guess\u chain\u encodings.py
"""
Usage example: guess_chain_encodings.py "u'баба'" "u'\xe1\xe0\xe1\xe0'"
"""
import six
import argparse
import binascii
import zlib
import utils_string as us
import ast
import collections
import itertools
import random

encodings = us.all_encodings()

Errors = (IOError, UnicodeEncodeError, UnicodeError, LookupError,
          TypeError, ValueError, binascii.Error, zlib.error)

def breadth_first_search(text, all = False):
    seen = set()
    tasks = collections.deque()
    tasks.append(([], text))
    while tasks:
        encs, text = tasks.popleft()
        for enc, newtext in candidates(text):
            if repr(newtext) not in seen:
                if not all:
                    seen.add(repr(newtext))
                newtask = encs+[enc], newtext
                tasks.append(newtask)
                yield newtask

def candidates(text):
    f = text.encode if isinstance(text, six.text_type) else text.decode
    results = []
    for enc in encodings:
        try:
            results.append((enc, f(enc)))
        except Errors as err:
            pass
    random.shuffle(results)
    for r in results:
        yield r

def fmt(encs, text):
    encode_decode = itertools.cycle(['encode', 'decode'])
    if not isinstance(text, six.text_type):
        next(encode_decode)
    chain = '.'.join( "{f}('{e}')".format(f = func, e = enc)
                     for enc, func in zip(encs, encode_decode) )
    return '{t!r}.{c}'.format(t = text, c = chain)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('start', type = ast.literal_eval, help = 'starting unicode')
    parser.add_argument('stop', type = ast.literal_eval, help = 'ending unicode')
    parser.add_argument('--all', '-a', action = 'store_true')    
    args = parser.parse_args()
    min_len = None
    for encs, text in breadth_first_search(args.start, args.all):
        if min_len is not None and len(encs) > min_len:
            break
        if type(text) == type(args.stop) and text == args.stop:
            print(fmt(encs, args.start))
            min_len = len(encs)

if __name__ == '__main__':
    main()

运行
% guess_chain_encodings.py "u'\x99\x8c\x85\x8d'" "u'\u05e9\u05dc\u05d5\u05dd'" --all

屈服
u'\x99\x8c\x85\x8d'.encode('latin_1').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('charmap').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('rot_13').decode('cp856')

等等。
哈哈，我从python的解释器中获取了这个值，结果是肯定的，即windows-1255
。哦，好吧。