在python中使用encode（'；utf-8'；）从Excel读取字符串的缺点_Python_Utf 8_Character Encoding_Xlrd

在python中使用encode（'；utf-8'；）从Excel读取字符串的缺点

python utf-8 character-encoding

在python中使用encode（'；utf-8'；）从Excel读取字符串的缺点,python,utf-8,character-encoding,xlrd,Python,Utf 8,Character Encoding,Xlrd,我正在从excel电子表格中读取大量数据，其中我使用以下一般结构从电子表格中读取（并重新格式化和重写）： book = open_workbook('file.xls') sheettwo = book.sheet_by_index(1) out = open('output.file', 'w') for i in range(sheettwo.nrows): z = i + 1 toprint = """formatting of the data im writing.

我正在从excel电子表格中读取大量数据，其中我使用以下一般结构从电子表格中读取（并重新格式化和重写）：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

其中，在本例中，x和y是任意单元格，x的任意性较小，并且包含utf-8字符

到目前为止，我只在单元格中使用了.encode（'utf-8'），我知道如果不使用utf-8，将会有错误或预见错误

我的问题基本上是这样的：在所有单元格上使用.encode（'utf-8'）是否有缺点，即使它是不必要的？效率不是问题。主要的问题是，即使在不应该出现utf-8字符的地方，它也能工作。如果我只是将“.encode（'utf-8'）”合并到每个读取的单元格中不会出现错误，我可能最终会这样做。

该文件明确指出：“从Excel 97开始，Excel电子表格中的文本已存储为Unicode。”。因为您可能正在读取比97新的文件，所以它们仍然包含Unicode代码点。因此，有必要在Python中将这些单元格的内容保持为Unicode格式，而不要将它们转换为ASCII（这是使用str（）函数实现的）。使用以下代码：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

这个答案实际上是对已接受答案的一些温和的注释，但是它们需要比SO注释工具提供的格式更好的格式

（1）避免水平滚动条会增加人们阅读代码的机会。尝试换行，例如：

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

（2）您可能正在使用

unicode（）

将浮点和整数转换为unicode；它对已经是unicode的值不做任何操作。请注意，

unicode（）

，与

str

（）一样，浮动的精度仅为12位：

>>> unicode(123456.78901234567)
u'123456.789012'

如果这是一个麻烦，您可能想尝试以下内容：

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

（3）

xlrd

在需要时动态构建

Cell

对象

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

这是一个很好的答案！非常感谢。然而，我刚刚意识到问题的第二部分。然后，我将使用它的输出上传到SQL表中。SQL是否支持修改后的代码的输出？str（sheettwo.cell（z，x）.value.encode（'utf-8'））对只包含unicode字符的utf-8字符串有什么作用？@MichaelKlocker:恐怕这个解决方案在Windows上不起作用：事实上，

codecs.open（）

以二进制模式打开文件，因此，

\n

不会转换为Windows换行代码。这个问题最简单的解决方案似乎不是使用

编解码器

，而是在编写（）时手动编码文本。@EOL-感谢您的提示。我没有意识到这一点。@logan-似乎您真的只想从Excel电子表格中获取数据，并将其存入数据库。甚至有必要为它编写Python代码吗？将文件保存为CSV文件可能会起到作用。Hi Logan，命令：str（sheettwo.cell（z，x）.value.encode（'utf-8'））。。。如果此单元格包含Unicode字符，则将失败。原因很简单。Unicode实际上只是向您显示代码点，如何将这些代码点写入磁盘取决于编码。现在，如果您尝试获取ASCII字符127以上的Unicode代码点，并尝试将此代码点强制转换为ASCII（通过使用str（）方法），Python将引发异常以防止丢失数据。您似乎对Unicode和UTF-8感到困惑。欲了解更多信息，请阅读：