如何在Python中检测文件是否为二进制（非文本）？_Python_File_Binary

如何在Python中检测文件是否为二进制（非文本）？

python file binary

如何在Python中检测文件是否为二进制（非文本）？,python,file,binary,Python,File,Binary,在Python中，如何判断文件是否为二进制（非文本）文件我正在用Python搜索一大组文件，并不断在二进制文件中找到匹配项。这使得输出看起来非常混乱我知道我可以使用grep-I，但我在数据方面做的比grep允许的更多在过去，我只会搜索大于0x7f的字符，但是utf8之类的东西使得这在现代系统中是不可能的。理想情况下，解决方案应该是快速的。您在unix中吗？如果是，请尝试： isBinary = os.system("file -b" + name + " | grep text >

在Python中，如何判断文件是否为二进制（非文本）文件

我正在用Python搜索一大组文件，并不断在二进制文件中找到匹配项。这使得输出看起来非常混乱

我知道我可以使用

grep-I

，但我在数据方面做的比grep允许的更多

在过去，我只会搜索大于

0x7f

的字符，但是

utf8

之类的东西使得这在现代系统中是不可能的。理想情况下，解决方案应该是快速的。

您在unix中吗？如果是，请尝试：

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

shell返回值是反向的（0是可以的，因此如果它找到“text”，那么它将返回一个0，在Python中，这是一个错误的表达式）。

通常您必须猜测

如果文件中有扩展名，您可以将其视为一条线索

您还可以识别已知的二进制格式，并忽略这些格式

否则，请查看您拥有的不可打印ASCII字节的比例，并从中进行猜测

您也可以尝试从UTF-8解码，看看是否能产生合理的输出。

您也可以使用该模块：

编译二进制mime类型列表相当容易。例如，Apache分发了一个mime.types文件，您可以将该文件解析为一组二进制和文本列表，然后检查mime是否在您的文本或二进制列表中。

如果有帮助，许多二进制类型都以一个神奇的数字开头。文件签名的数量。

这里有一个建议，可以使用Unix命令：

用法示例：

>>> istext('/etc/motd') True >>> istext('/vmlinuz') False >>> open('/tmp/japanese').read() '\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n' >>> istext('/tmp/japanese') # works on UTF-8 True

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

>>>istext（“/etc/motd”）真的 >>>istext（'/vmlinuz'）假的 >>>打开（'/tmp/japanese'）。读取（） “\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xe3\xe3\x81\xae\xe5\x89\xe9\x96\xe3\xe1\xe2\x891\xen” >>>istext（'/tmp/japanese'）#适用于UTF-8 真的

它的缺点是不能移植到Windows（除非您有类似于

file

的命令），并且必须为每个文件生成一个外部进程，这可能是不可取的。

如果您不在Windows上，您可以使用来确定文件类型。然后，您可以检查它是否为文本/mime类型。

尝试以下操作：

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

def是二进制文件（文件名）：
“”“如果给定的文件名是二进制文件，则返回true。
@raise环境错误：如果文件不存在或无法访问。
@注意：found@http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text 2010年8月6日
@作者：特伦特·米克
@作者：豪尔赫·奥菲勒
fin=打开（文件名为“rb”）
尝试：
CHUNKSIZE=1024
而1：
chunk=fin.read（CHUNKSIZE）
如果区块中有“\0”：找到空字节
返回真值
如果len（chunk）

我想最好的解决方案是使用guess\u类型函数。它包含一个包含多个mimetype的列表，您还可以包含自己的类型。下面是我为解决问题所做的脚本：

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

它位于一个类的内部，正如您根据代码的UstStructure所看到的。但是，您可以在应用程序中更改要实现它的内容。它使用起来很简单。 getTextFiles方法返回一个列表对象，其中包含驻留在path变量中传递的目录中的所有文本文件。

还有一个方法：

例如：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

我来这里是为了寻找完全相同的东西——一个由标准库提供的检测二进制或文本的全面解决方案。在回顾了人们建议的选项之后，nixfile命令似乎是最好的选择（我只为linux boxen开发）。其他一些人发布了使用文件的解决方案，但在我看来，这些解决方案过于复杂，因此我提出了以下建议：

def test_file_isbinary(filename): cmd = shlex.split("file -b -e soft '{}'".format(filename)) if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}: return False return True

不用说，但是调用此函数的代码应该确保在测试之前可以读取文件，否则会错误地将文件检测为二进制文件。
较短的解决方案，带有UTF-16警告：

def is_binary(filename): """ Return true if the given filename appears to be binary. File is considered to be binary if it contains a NULL byte. FIXME: This approach incorrectly reports UTF-16 as binary. """ with open(filename, 'rb') as f: for block in f: if b'\0' in block: return True return False
使用库（）
它非常简单，并且基于这个stackoverflow问题中的代码

实际上，你可以用两行代码来编写这段代码，但是这个软件包可以让你不用编写和彻底测试这两行代码，而且可以跨平台使用各种奇怪的文件类型。
如果你使用的是带utf-8的python3，这是很简单的，只要在文本模式下打开文件，如果出现
UnicodeDecodeError
，请停止处理。Python3在以文本模式（以及二进制模式下的bytearray）处理文件时将使用unicode—如果您的编码无法解码任意文件，则很可能会出现
UnicodeDecodeError
例如：

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024)) True >>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024)) False

try: with open(filename, "r") as f: for l in f: process_line(l) except UnicodeDecodeError: pass # Fond non-text data

大多数程序认为文件是二进制的（如果不是包含“面向行”的文件），如果它包含一个.< /p> 以下是用Python实现的perl版本的
pp_fttext（）
（
pp_sys.c
）：

import sys PY3 = sys.version_info[0] == 3 # A function that takes an integer in the 8-bit range and returns # a single-character byte object in py3 / a single-character string # in py2. # int2byte = (lambda x: bytes((x,))) if PY3 else chr _text_characters = ( b''.join(int2byte(i) for i in range(32, 127)) + b'\n\r\t\f\b') def istextfile(fileobj, blocksize=512): """ Uses heuristics to guess whether the given file is text or binary, by reading a single block of bytes from the file. If more than 30% of the chars in the block are non-text, or there are NUL ('\x00') bytes in the block, assume this is a binary file. """ block = fileobj.read(blocksize) if b'\x00' in block: # Files with null bytes are binary return False elif not block: # An empty file is considered a valid text file return True # Use translate's 'deletechars' argument to efficiently remove all # occurrences of _text_characters from the block nontext = block.translate(None, _text_characters) return float(len(nontext)) / len(block) <= 0.30
导入系统 PY3=系统版本信息[0]==3 #接受8位范围内的整数并返回 #py3中的单字符字节对象/单字符字符串 #在py2中。 # int2byte=（lambda x:bytes（（x，））如果PY3-else-chr _文本字符=( 连接（范围（32227）中的i的int2byte（i））+ b'\n\r\t\f\b'） def istextfile（fileobj，blocksize=512）： “”“使用试探法猜测给定文件是文本文件还是二进制文件，通过从文件中读取单个字节块。如果块中超过30%的字符为非文本，或是块中的NUL（'\x00'）字节，假设这是一个二进制文件。 """ block=fileobj.read（块大小）如果块中有b'\x00'： #具有空字节的文件是二进制文件返回错误
b'\x00' in open("foo.bar", 'rb').read()

#!/usr/bin/env python3 import argparse if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('file', nargs=1) args = parser.parse_args() with open(args.file[0], 'rb') as f: if b'\x00' in f.read(): print('The file is binary!') else: print('The file is not binary!')

$ ./is_binary.py /etc/hosts The file is not binary! $ ./is_binary.py `which which` The file is binary!

import codecs #: BOMs to indicate that a file is a text file even if it contains zero bytes. _TEXT_BOMS = ( codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE, codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE, codecs.BOM_UTF8, ) def is_binary_file(source_path): with open(source_path, 'rb') as source_file: initial_bytes = source_file.read(8192) return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \ and b'\0' in initial_bytes

from os.path import realpath from subprocess import check_output from shlex import split filepath = realpath('rel/or/abs/path/to/file') assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

import os for afile in [x for x in os.listdir('.') if os.path.isfile(x)]: assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]): for afile in filelist: assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

def is_binary(file_name): try: with open(file_name, 'tr') as check_file: # try open file in text mode check_file.read() return False except: # if fail then file is non-text (binary) return True

>>> import magic >>> magic.from_file("testdata/test.pdf", mime=True) 'application/pdf' >>> magic.from_file("testdata/test.pdf") 'PDF document, version 1.2' >>> magic.from_buffer(open("testdata/test.pdf").read(1024)) 'PDF document, version 1.2'

from binaryornot.check import is_binary is_binary('filename')