在python中使用umlauts读取/写入文件（html到txt）_Python_Utf 8

在python中使用umlauts读取/写入文件（html到txt）

python utf-8

在python中使用umlauts读取/写入文件（html到txt）,python,utf-8,Python,Utf 8,我知道这已经被问了好几次了，但我认为我做的每件事都是对的，它仍然不起作用，所以在我变得临床精神错乱之前，我会发表一篇文章。这是代码（它应该将HTML文件转换为txt文件，并省略某些行）： fid=codecs.open（htmlFile，“r”，encoding=“utf-8”）如果不是fid：返回 htmlText=fid.read（） fid.close（） stripped=strip_标记（unicode（htmlText））35;###strip html标记（这不是问题）行=剥

我知道这已经被问了好几次了，但我认为我做的每件事都是对的，它仍然不起作用，所以在我变得临床精神错乱之前，我会发表一篇文章。这是代码（它应该将HTML文件转换为txt文件，并省略某些行）：

fid=codecs.open（htmlFile，“r”，encoding=“utf-8”）
如果不是fid：
返回
htmlText=fid.read（）
fid.close（）
stripped=strip_标记（unicode（htmlText））35;###strip html标记（这不是问题）
行=剥离。拆分（“\n”）
out=[]
对于行中行：#只是一些我想从输出中漏掉的东西
如果长度（线）<6：
持续
如果“*”在第行或“（”在第行或“@”在第行或“：”在第行：
持续
out.append（行）
结果='\n'。加入（输出）
base，ext=os.path.splitext（htmlFile）
outfile=base+'.txt'
fid=编解码器。打开（输出文件，“w”，编码='utf-8'）
fid.写入（结果）
fid.close（）

谢谢！

不太确定，但请

'\n'.join(out)

使用非unicode字符串（但是普通的旧

字节

字符串），您可能会退回到一些非UTF-8编解码器。请尝试：

u'\n'.join(out)

确保您在任何地方都使用unicode对象。

您尚未指定问题，因此这是一个完整的猜测

strip\u tags（）

函数返回的是什么？它返回的是unicode对象还是字节字符串？如果是后者，则在尝试将其写入文件时可能会导致解码问题。例如，如果

strip\u tags（）

返回的是utf-8编码的字节字符串：

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

而且，通过

codecs.open（）

打开的文件读取的数据将自动转换为unicode，因此调用

unicode（htmlText）

将一无所获。

您能说明实际问题是什么吗？

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

try:
    with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
        htmlText = fid.read()
except IOError, e:
    # handle error
    print e