Python 删除目录中具有非utf-8符号的所有文件_Python_Unicode_Python 3.x_Filenames

Python 删除目录中具有非utf-8符号的所有文件

python unicode python-3.x

Python 删除目录中具有非utf-8符号的所有文件,python,unicode,python-3.x,filenames,Python,Unicode,Python 3.x,Filenames,我有一组数据，但我只需要使用utf-8数据，因此我需要删除所有带有非utf-8符号的数据当我尝试使用这些文件时，我收到： UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 157

我有一组数据，但我只需要使用

utf-8

数据，因此我需要删除所有带有非

utf-8

符号的数据

当我尝试使用这些文件时，我收到：

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte

我在这里收到了错误

yield[file\u name，body.read（）]

和这里

list\u of\u emails=mailsrch.findall（text）

，但是当我使用utf-8时，一切都很好。

我怀疑你想在

字节上使用errors='ignore'
参数。有关更多信息，请参阅和
编辑：
下面的示例展示了一种很好的方法：
for file_name in os.listdir(self.path_to_dir):
    if not file_name.startswith("!"):
        fullpath = os.path.join(self.path_to_dir, file_name)
        with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
            yield [file_name, body.read()]  

使用os.path.join
，您可以消除add\u slash
方法，并确保其跨平台工作。
所有ASCII字符也都是UTF-8字符。。你的意思可能是“非ASCII”吗？当我在程序中使用其他符号时，我收到UnicodeDecodeError:“charmap”编解码器无法解码3062位的字节0x8d：字符映射到
和UnicodeDecodeError:“utf8”编解码器无法解码1576位的字节0xc1：无效的起始字节
你也可以包含你的代码吗，请？所以我想把.decode（“utf-8”，“忽略”）
添加到我的代码yield[file\u name，body.read（）]
中，就像这样yield[file\u name，body.read（）.decode（“utf-8”，“忽略”）]
？顺便说一句，我仍然有错误UnicodeDecodeError:“utf8”编解码器无法解码位置1576处的字节0xc1：无效的开始字节
否，因为您使用的是io.open
，您将使用其中的errors
参数，例如io.open（文件名，'r'，encoding='utf-8'，errors='ignore'）
。另外，请注意，os.path.join
构建路径通常是个好主意。我还建议使用内置的open。我将更新我的答案以显示这些问题的示例。
for file_name in os.listdir(self.path_to_dir):
    if not file_name.startswith("!"):
        fullpath = os.path.join(self.path_to_dir, file_name)
        with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
            yield [file_name, body.read()]