Python 以最快的速度统计数百万个文件_Python_Performance_File Io_Comparison_Stat

Python 以最快的速度统计数百万个文件

python performance file-io

Python 以最快的速度统计数百万个文件,python,performance,file-io,comparison,stat,Python,Performance,File Io,Comparison,Stat,所以我想对一些科学数据文件做一些“大数据”分析。更具体地说，我想收集这些文件的atime、mtime和CTIME，以确定适当的分层存储管理策略，从而有效地规划存储层制约因素包括：我们有许多solaris系统（zfs）和一些linux系统我们有超过十亿个文件我们大概有100台存储服务器我已经准备好让数据收集“尴尬地并行化” 我使用find命令进行了调查，但solaris版本相当有限（没有-print）。所以我决定编写一个python脚本来收集数据。它看起来像这样： def os_pa

所以我想对一些科学数据文件做一些“大数据”分析。更具体地说，我想收集这些文件的atime、mtime和CTIME，以确定适当的分层存储管理策略，从而有效地规划存储层

制约因素包括：

我们有许多solaris系统（zfs）和一些linux系统
我们有超过十亿个文件
我们大概有100台存储服务器

我已经准备好让数据收集“尴尬地并行化”

我使用

find

命令进行了调查，但solaris版本相当有限（没有

-print

）。所以我决定编写一个python脚本来收集数据。它看起来像这样：

def os_path_get( filename ):
  return {
    'atime': os.path.getatime( filename ),
    'mtime': os.path.getmtime( filename ),
    'ctime': os.path.getctime( filename ),
    'size': os.path.getsize( filename )
  }

def stat_get( filename ):
  (mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime) = os.lstat( filename )
  return {
    'atime': atime,
    'mtime': mtime,
    'ctime': ctime,
    'size': size,
    'inode': ino,
  }

for root, directory, filenames in os.walk(this):
  for filename in filenames:
    path = os.path.join(root,filename)
    if not os.path.islink( path ):
      # d = os_path_get( path )
      d = stat_get( path )
      print "%s\t%s\t%s\t%s" % ( d['size'], d['ctime'], d['mtime'], d['atime'] )

（我可能会转储

now（）

时间，或者在输出中进行差异计算）

然而，在一台拥有5200万个文件的测试服务器上，使用

os\u path\u get（）

运行大约需要70个小时

我对使用

stat\u get（）

over

os\u path\u get（）

进行的非科学测试表明，后者可能要慢50%

关于如何提高元数据收集速度有什么建议吗？（多线程…

os.scandir（）

…）