Python 2.7 如何解码';xmlcharrefreplace';python3中的ascii到UTF-8?
我的python2代码(psp): 生成以下内容:Python 2.7 如何解码';xmlcharrefreplace';python3中的ascii到UTF-8?,python-2.7,python-3.x,utf-8,Python 2.7,Python 3.x,Utf 8,我的python2代码(psp): 生成以下内容: "python - питон, прорицатель", "cobra - 

"python - питон, прорицатель",
"cobra - кобра, очковая змея",
我尝试用python3读取这些数据。python3的默认代码页是UTF-8
“python-Паааааааа”
“眼镜蛇-眼镜蛇,眼镜蛇”
我需要数一数俄罗斯符号的数量。但计算等级库符号的数量:“&”、“#”、“;”,和数字
如何将“xmlcharrefreplace”ascii解码为UTF-8,以便与python3(UTF-8)代码中的硬编码俄语符号进行比较:
我的stdout看起来像:
utf-8
UTF-8
{'russian': {}, 'other': {'~': 3, '=': 169, '<': 300, '?': 473, '>': 312, ';': 318392, ':': 222, '%': 29, "'": 31, '&': 318409, '!': 36, ' ': 51427, '#': 318390, '"': 320, '-': 9822, ',': 21578, '/': 843, '.': 800, ')': 527, '(': 526, '+': 2, ']': 8, '_': 117, '[': 8, '|': 1, '\r': 224, '\n': 224, '\t': 38, '`': 3, '5': 31451, '4': 23216, '7': 131141, '6': 40036, '1': 352560, '0': 373246, '3': 25196, '2': 37785, '9': 81825, '8': 177608, 'u': 3354, 't': 7281, 'w': 1179, 'v': 1074, 'q': 214, 'p': 2966, 's': 5816, 'r': 6948, 'y': 1714, 'x': 318, 'z': 222, 'e': 10841, 'd': 2918, 'g': 1996, 'f': 1801, 'a': 7069, 'c': 4020, 'b': 1805, 'm': 2337, 'l': 4821, 'o': 5906, 'n': 6307, 'i': 8068, 'h': 2559, 'k': 902, 'j': 142}}
utf-8
UTF-8
"俄文":{},"其他":{"3","169","312","318392","222","29",""31","318409"""""""""""""[':8','124':1',\r':224',\n':224',\t':38'''''':3,'5':31451,'4':23216,'7':131141,'6':40036,'1':352560,'0':373246,'3':25196,'2':37785,'9':81825,'8':177608,'u':3354,'t':7281,'w':1179,'v':1074,'q':214,'p':2966,'s':5816,'r':6948,'y':1714,'x':318,'z':222,'e':10841,'d,'a':1996a':297069,'c':4020,'b':1805,'m':2337,'l':4821,'o':5906,'n':6307,'i':8068,'h':2559,'k':902,'j':142}
使用:
演示:
我将使用collections.Counter()
对象来计算字符数:
from collections import Counter
from html.parser import HTMLParser
ru_abc = set('абвгдеёжзийклмнопрстуфхцчшщъыьэюя')
parser = HTMLParser()
stat_data = {'other': Counter(), 'russian': Counter()}
with open(filename) as fileobj:
for line in fileobj:
line = parser.unescape(line)
stat_data['russian'].update(c for c in line if c in ru_abc)
stat_data['other'].update(c for c in line if c not in ru_abc)
结果:
{
'other': Counter({' ': 23, ',': 4, '"': 4, '\n': 3, 'o': 2, '-': 2, 'y': 1, 't': 1, 'b': 1, 'r': 1, 'p': 1, 'n': 1, 'h': 1, 'c': 1, 'a': 1}),
'russian': Counter({'о': 5, 'а': 3, 'р': 3, 'п': 2, 'к': 2, 'и': 2, 'е': 2, 'я': 2, 'т': 2, 'н': 1, 'м': 1, 'л': 1, 'з': 1, 'в': 1, 'б': 1, 'ь': 1, 'ч': 1, 'ц': 1})
}
from html.parser import HTMLParser
parser = HTMLParser()
with open(filename) as fileobj:
for line in fileobj:
line = parser.unescape(line)
>>> parser.unescape(' "python - питон, прорицатель",')
' "python - питон, прорицатель",'
from collections import Counter
from html.parser import HTMLParser
ru_abc = set('абвгдеёжзийклмнопрстуфхцчшщъыьэюя')
parser = HTMLParser()
stat_data = {'other': Counter(), 'russian': Counter()}
with open(filename) as fileobj:
for line in fileobj:
line = parser.unescape(line)
stat_data['russian'].update(c for c in line if c in ru_abc)
stat_data['other'].update(c for c in line if c not in ru_abc)
{
'other': Counter({' ': 23, ',': 4, '"': 4, '\n': 3, 'o': 2, '-': 2, 'y': 1, 't': 1, 'b': 1, 'r': 1, 'p': 1, 'n': 1, 'h': 1, 'c': 1, 'a': 1}),
'russian': Counter({'о': 5, 'а': 3, 'р': 3, 'п': 2, 'к': 2, 'и': 2, 'е': 2, 'я': 2, 'т': 2, 'н': 1, 'м': 1, 'л': 1, 'з': 1, 'в': 1, 'б': 1, 'ь': 1, 'ч': 1, 'ц': 1})
}