Python 2.7.8的默认编码是什么?
当我用codecdes.open('f.txt','r',encoding=None)打开文件时,Python 2.7.8会选择一些默认编码 是哪一个?这些记录在哪里 显示默认编码不是Python 2.7.8的默认编码是什么?,python,python-2.7,encoding,character-encoding,Python,Python 2.7,Encoding,Character Encoding,当我用codecdes.open('f.txt','r',encoding=None)打开文件时,Python 2.7.8会选择一些默认编码 是哪一个?这些记录在哪里 显示默认编码不是utf-8,ascii,sys.getdefaultencoding(),locale.getpreferredencoding(),或locale.getpreferredencoding(False) 编辑(澄清我的动机):我想知道当我运行这样的脚本时,Python 2.7.8选择了哪种编码: f = code
utf-8
,ascii
,sys.getdefaultencoding()
,locale.getpreferredencoding()
,或locale.getpreferredencoding(False)
编辑(澄清我的动机):我想知道当我运行这样的脚本时,Python 2.7.8选择了哪种编码:
f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
print len(line) # obviously SOME encoding has been chosen if I can print the number of characters
我对猜测文件编码的其他方法不感兴趣。它基本上不会进行任何透明的编码/解码,它只是打开文件并返回它 以下是库中的代码:-
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
""" Open an encoded file using the given mode and return
a wrapped version providing transparent encoding/decoding.
Note: The wrapped version will only accept the object format
defined by the codecs, i.e. Unicode objects for most builtin
codecs. Output is also codec dependent and will usually be
Unicode as well.
Files are always opened in binary mode, even if no binary mode
was specified. This is done to avoid data loss due to encodings
using 8-bit values. The default file mode is 'rb' meaning to
open the file in binary read mode.
encoding specifies the encoding which is to be used for the
file.
errors may be given to define the error handling. It defaults
to 'strict' which causes ValueErrors to be raised in case an
encoding error occurs.
buffering has the same meaning as for the builtin open() API.
It defaults to line buffered.
The returned wrapped file object provides an extra attribute
.encoding which allows querying the used encoding. This
attribute is only available if an encoding was specified as
parameter.
"""
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
正如您所看到的,如果encoding为None,它只返回打开的文件
这是您的文件,每个字节以十进制表示,显示其相应的ascii字符:
46 .
46 .
46 .
32 'space'
48 0
45 -
49 1
10 'line feed'
10 'line feed'
91 [
69 E
118 v
101 e
110 n
116 t
32 'space'
34 "
72 H
97 a
114 r
118 v
97 a
114 r
100 d
32 'space'
67 C
117 u
112 p
32 'space'
51 3
48 0
180 'this is not ascii'
34 "
93 ]
10 'line feed'
46 .
46 .
46 .
在ascii中打开它时遇到的问题是十进制值为180的字节。Ascii码只能升到127。这让我想到这一定是某种扩展的ascii,其中128-255用于额外的符号。在仔细阅读了维基百科关于ascii()的文章后,它提到了一个流行的ascii扩展,名为windows-1252。在windows-1252中,十进制值180映射到锐重音字符(')。然后我决定用谷歌搜索你文件中的字符串,看看它实际上与什么有关。这就是我发现“哈佛杯30´”
总之,正确的编码可能是windows-1252。这是我的测试程序:-
import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
print f.read()
输出
... 0-1
[Event "Harvard Cup 30´"]
...
使用codecdes.open('f.txt','r',encoding=None)
读取文件时返回字节字符串而不是Unicode字符串。它根本不尝试用编码来解码文件数据。它相当于open('f.txt','r')
。您收到的长度是存储在文件中的行中的单个字节数,没有翻译
一个小例子:
>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'
Python的默认编码是ASCII,如下所述:那么我们如何解释这一点呢?那么,当我迭代返回文件中的行时,选择了什么编码?e、 g.如果你使用f=open('f.txt','r')打印f.encodeging你试过chardet了吗?还没有——它似乎很有用,但我只是想重现
open('f.txt','r')的行为;对于f:pass
print(open('test.txt','r').encoding)
printsNone
实际上,尽管有python文档,我相信codecs.open('f.txt','r',encoding=None)实际上等同于open('f.txt','r')not open('f.txt','rb)。它仅在指定编码时添加“b”。看看我答案中的库代码。@StephenBriney,你说得对。我会更新的。证据也在第一行编解码器中。打开行。它只返回\n
,而不是\r\n
,指示文本模式而不是二进制模式。我在发布时没有注意到这一点。我认为文档中说“文件总是以二进制模式打开,即使没有指定二进制模式。”