Python 散列多个文件 问题说明:

Python 散列多个文件 问题说明:,python,perl,bash,hash,batch-processing,Python,Perl,Bash,Hash,Batch Processing,给定一个目录,我想遍历该目录及其非隐藏子目录, 并在非隐藏的 文件名。 如果重新运行脚本,它将用新的哈希替换旧的哈希 ==>。 。==>。 问题: a) 你会怎么做? b) 在所有可用的方法中,什么使您的方法最合适? 判决: 谢谢大家,我选择SeigeX的答案是因为它的速度和便携性。 它比其他bash变体更快, 它在我的MacOSX机器上运行,没有任何改动 您可能希望将结果存储在一个文件中,如 find . -type f -exec md5sum {} \; > MD5SUMS 如

给定一个目录,我想遍历该目录及其非隐藏子目录,
并在非隐藏的 文件名。
如果重新运行脚本,它将用新的哈希替换旧的哈希

==>

==>


问题: a) 你会怎么做? b) 在所有可用的方法中,什么使您的方法最合适?
判决: 谢谢大家,我选择SeigeX的答案是因为它的速度和便携性。
它比其他bash变体更快,
它在我的MacOSX机器上运行,没有任何改动


您可能希望将结果存储在一个文件中,如

find . -type f -exec md5sum {} \; > MD5SUMS
如果确实希望每个散列一个文件:

find . -type f | while read f; do g=`md5sum $f` > $f.md5; done
甚至

find . -type f | while read f; do g=`md5sum $f | awk '{print $1}'`; echo "$g $f"> $f-$g.md5; done
使用zsh:

$ ls
a.txt
b.txt
c.txt
魔法:

$ FILES=**/*(.) 
$ # */ stupid syntax coloring thinks this is a comment
$ for f in $FILES; do hash=`md5sum $f | cut -f1 -d" "`; mv $f "$f:r.$hash.$f:e"; done
$ ls
a.60b725f10c9c85c70d97880dfe8191b3.txt
b.3b5d5c3712955042212316173ccf37be.txt
c.2cd6ee2c70b0bde53fbe6cac3c8b8bb1.txt
快乐解构


编辑:在sh或bash的两个版本中,在子目录和
mv
argument

周围添加文件。一种是将自身限制为具有扩展名的文件

find . -type f -print | while read file
do
    hash=`$hashcommand "$file"`
    filename=${file%.*}
    extension=${file##*.}
    mv $file "$filename.$hash.$extension"
done
hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f -a -name '*.*' | while read f; do
  # remove the echo to run this for real
  echo mv "$f" "${f%.*}.whirlpool-`hash "$f"`.${f##*.}"
done
测试

...
mv ./bash-4.0/signames.h ./bash-4.0/signames.whirlpool-d71b117a822394a5b273ea6c0e3f4dc045b1098326d39864564f1046ab7bd9296d5533894626288265a1f70638ee3ecce1f6a22739b389ff7cb1fa48c76fa166.h
...
这个更复杂的版本处理所有普通文件,有或没有扩展名,有或没有空格和奇数字符,等等

hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f | while read f; do
  name=${f##*/}
  case "$name" in
    *.*) extension=".${name##*.}" ;;
    *)   extension=   ;;
  esac
  # remove the echo to run this for real
  echo mv "$f" "${f%/*}/${name%.*}.whirlpool-`hash "$f"`$extension"
done
  • 在包含空格(如“a b”)的文件上测试
  • 在包含多个扩展名(如“a.b.c”)的文件上进行测试
  • 使用包含空格和/或点的目录进行测试
  • 在包含点的目录(如“a.b/c”)中不包含扩展名的文件上进行测试
  • 更新:如果文件更改,现在更新哈希
要点:

  • 在读取-d$'\0'时,使用管道传输到
    print0
    来正确处理文件名中的空格
  • md5sum可以替换为您最喜欢的哈希函数。sed从md5sum的输出中删除第一个空格及其后面的所有内容
  • 基本文件名是使用一个正则表达式提取的,该正则表达式查找最后一个没有后跟另一个斜杠的句点(这样,目录名中的句点就不会被计算为扩展名的一部分)
  • 扩展名是通过使用起始索引为基本文件名长度的子字符串找到的

    • 以下是我在bash中对它的看法。功能:跳过非常规文件;正确处理名称中带有奇怪字符(即空格)的文件;处理无扩展文件名;跳过已经散列的文件,以便可以重复运行(尽管如果在运行之间修改了文件,则会添加新的散列,而不是替换旧的散列)。我使用md5-q作为散列函数编写了它;您应该能够用任何其他内容替换它,只要它只输出散列,而不是像filename=>hash这样的内容

      find -x . -type f -print0 | while IFS="" read -r -d $'\000' file; do
          hash="$(md5 -q "$file")" # replace with your favorite hash function
          [[ "$file" == *."$hash" ]] && continue # skip files that already end in their hash
          dirname="$(dirname "$file")"
          basename="$(basename "$file")"
          base="${basename%.*}"
          [[ "$base" == *."$hash" ]] && continue # skip files that already end in hash + extension
          if [[ "$basename" == "$base" ]]; then
                  extension=""
          else
                  extension=".${basename##*.}"
          fi
          mv "$file" "$dirname/$base.$hash$extension"
      done
      

      需求的逻辑非常复杂,足以证明使用Python而不是bash是合理的。它应该提供更具可读性、可扩展性和可维护性的解决方案

      #!/usr/bin/env python
      import hashlib, os
      
      def ishash(h, size):
          """Whether `h` looks like hash's hex digest."""
          if len(h) == size: 
              try:
                  int(h, 16) # whether h is a hex number
                  return True
              except ValueError:
                  return False
      
      for root, dirs, files in os.walk("."):
          dirs[:] = [d for d in dirs if not d.startswith(".")] # skip hidden dirs
          for path in (os.path.join(root, f) for f in files if not f.startswith(".")):
              suffix = hash_ = "." + hashlib.md5(open(path).read()).hexdigest()
              hashsize = len(hash_) - 1
              # extract old hash from the name; add/replace the hash if needed
              barepath, ext = os.path.splitext(path) # ext may be empty
              if not ishash(ext[1:], hashsize):
                  suffix += ext # add original extension
                  barepath, oldhash = os.path.splitext(barepath) 
                  if not ishash(oldhash[1:], hashsize):
                     suffix = oldhash + suffix # preserve 2nd (not a hash) extension
              else: # ext looks like a hash
                  oldhash = ext
              if hash_ != oldhash: # replace old hash by new one
                 os.rename(path, barepath+suffix)
      
      这是一个测试目录树。它包括:

      • 名称中带有点的目录中没有扩展名的文件
      • 已包含哈希的文件名(幂等性测试)
      • 带有两个扩展名的文件名
      • 名称中的换行符
      要使用它:

      import mhash
      
      print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()
      
      输出: CBDCA4520CC5C131FC3A86109DD23FEE2D7FF7BE5663D398180178378944A4F41480B938608AE98DA7ECCBF39A4C79B83A8590C4CB1BACE5BC638FC92B3E653


      在Python中调用whirlpooldeep

      可以为需要基于哈希值跟踪文件集的问题提供利用

      whirlpool不是很常见的杂烩。您可能需要安装一个程序来计算它。e、 Debian/Ubuntu包含一个“惠而浦”软件包。程序自己打印一个文件的散列。apt cache search whirlpool显示,其他一些软件包也支持它,包括有趣的md5deep

      一些早期的Anwser在文件名中包含空格时会失败。如果是这种情况,但文件名中没有任何换行符,则可以安全地使用\n作为分隔符

      
      oldifs="$IFS"
      IFS="
      "
      for i in $(find -type f); do echo "$i";done
      #output
      # ./base
      # ./base2
      # ./normal.ext
      # ./trick.e "xt
      # ./foo bar.dir ext/trick' (name "- }$foo.ext{}.ext2
      IFS="$oldifs"
      
      尝试不设置IFS,看看它为什么重要

      我打算用IFS=“”;读取数组时查找-print0 |,以拆分“.”字符,但我通常从不使用数组变量。我在手册页中看到,插入散列作为第二个最后一个数组索引,并向下推最后一个元素(文件扩展名,如果有的话)并不是一个简单的方法。每当bash数组变量看起来有趣时,我知道是时候用perl来做我正在做的事情了!请参阅有关使用读取的gotchas:

      我决定使用另一种我喜欢的技术:find-exec sh-c。这是最安全的,因为您没有解析文件名

      这应该可以做到:

      
      find -regextype posix-extended -type f -not -regex '.*\.[a-fA-F0-9]{128}.*'  \
      -execdir bash -c 'for i in "${@#./}";do 
       hash=$(whirlpool "$i");
       ext=".${i##*.}"; base="${i%.*}";
       [ "$base" = "$i" ] && ext="";
       newname="$base.$hash$ext";
       echo "ext:$ext  $i -> $newname";
       false mv --no-clobber "$i" "$newname";done' \
      dummy {} +
      # take out the "false" before the mv, and optionally take out the echo.
      # false ignores its arguments, so it's there so you can
      # run this to see what will happen without actually renaming your files.
      
      -execdir bash-c'cmd'dummy{}+在那里有一个dummy arg,因为命令后面的第一个arg在shell的位置参数中变成$0,而不是循环中的“$@”的一部分。我使用execdir而不是exec,因此我不必处理目录名(或者当实际文件名都足够短时,对于具有长名称的嵌套dir,可能会超过PATH_MAX)

      -not-regex防止将此应用于同一文件两次。虽然whirlpool是一个非常长的散列,mv说如果我在没有检查的情况下运行两次,文件名就太长了。(在XFS文件系统上。)

      没有扩展名的文件获取basename.hash。我必须特别检查,以避免添加尾随符号,或将basename作为扩展名${@#./}去掉了每个文件名前面的前导字符./that find put,因此对于没有扩展名的文件,整个字符串中没有“.”

      mv——任何clobber都不能是GNU扩展。如果您没有GNU mv,如果您希望避免删除现有文件,请执行其他操作(例如,您只运行一次,同一文件中的某些文件将以旧名称添加到目录中;您将再次运行该文件。)OTOH,如果您想要这种行为,请将其删除

      即使文件名包含换行符,我的解决方案也应该有效
      $ sudo apt-get install python-mhash
      
      import mhash
      
      print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()
      
      from subprocess import PIPE, STDOUT, Popen
      
      def getoutput(cmd):
          return Popen(cmd, stdout=PIPE, stderr=STDOUT).communicate()[0]
      
      hash_ = getoutput(["whirlpooldeep", "-q", path]).rstrip()
      
      
      oldifs="$IFS"
      IFS="
      "
      for i in $(find -type f); do echo "$i";done
      #output
      # ./base
      # ./base2
      # ./normal.ext
      # ./trick.e "xt
      # ./foo bar.dir ext/trick' (name "- }$foo.ext{}.ext2
      IFS="$oldifs"
      
      
      find -regextype posix-extended -type f -not -regex '.*\.[a-fA-F0-9]{128}.*'  \
      -execdir bash -c 'for i in "${@#./}";do 
       hash=$(whirlpool "$i");
       ext=".${i##*.}"; base="${i%.*}";
       [ "$base" = "$i" ] && ext="";
       newname="$base.$hash$ext";
       echo "ext:$ext  $i -> $newname";
       false mv --no-clobber "$i" "$newname";done' \
      dummy {} +
      # take out the "false" before the mv, and optionally take out the echo.
      # false ignores its arguments, so it's there so you can
      # run this to see what will happen without actually renaming your files.
      
      #!/usr/bin/env ruby
      require 'digest/md5'
      
      Dir.glob('**/*') do |f|
        next unless File.file? f
        next if /\.md5sum-[0-9a-f]{32}/ =~ f
        md5sum = Digest::MD5.file f
        newname = "%s/%s.md5sum-%s%s" %
          [File.dirname(f), File.basename(f,'.*'), md5sum, File.extname(f)]
        File.rename f, newname
      end
      
      #!/usr/bin/env bash #Tested with: # GNU bash, version 4.0.28(1)-release (x86_64-pc-linux-gnu) # ksh (AT&T Research) 93s+ 2008-01-31 # mksh @(#)MIRBSD KSH R39 2009/08/01 Debian 39.1-4 # Does not work with pdksh, dash DEFAULT_SUM="md5" #Takes a parameter, as root path # as well as an optional parameter, the hash function to use (md5 or wp for whirlpool). main() { case $2 in "wp") export SUM="wp" ;; "md5") export SUM="md5" ;; *) export SUM=$DEFAULT_SUM ;; esac # For all visible files in all visible subfolders, move the file # to a name including the correct hash: find $1 -type f -not -regex '.*/\..*' -exec $0 hashmove '{}' \; } # Given a file named in $1 with full path, calculate it's hash. # Output the filname, with the hash inserted before the extention # (if any) -- or: replace an existing hash with the new one, # if a hash already exist. hashname_md5() { pathname="$1" full_hash=`md5sum "$pathname"` hash=${full_hash:0:32} filename=`basename "$pathname"` prefix=${filename%%.*} suffix=${filename#$prefix} #If the suffix starts with something that looks like an md5sum, #remove it: suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{32}//'` echo "$prefix.$hash$suffix" } # Same as hashname_md5 -- but uses whirlpool hash. hashname_wp() { pathname="$1" hash=`whirlpool "$pathname"` filename=`basename "$pathname"` prefix=${filename%%.*} suffix=${filename#$prefix} #If the suffix starts with something that looks like an md5sum, #remove it: suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{128}//'` echo "$prefix.$hash$suffix" } #Given a filepath $1, move/rename it to a name including the filehash. # Try to replace an existing hash, an not move a file if no update is # needed. hashmove() { pathname="$1" filename=`basename "$pathname"` path="${pathname%%/$filename}" case $SUM in "wp") hashname=`hashname_wp "$pathname"` ;; "md5") hashname=`hashname_md5 "$pathname"` ;; *) echo "Unknown hash requested" exit 1 ;; esac if [[ "$filename" != "$hashname" ]] then echo "renaming: $pathname => $path/$hashname" mv "$pathname" "$path/$hashname" else echo "$pathname up to date" fi } # Create som testdata under /tmp mktest() { root_dir=$(tempfile) rm "$root_dir" mkdir "$root_dir" i=0 test_files[$((i++))]='test' test_files[$((i++))]='testfile, no extention or spaces' test_files[$((i++))]='.hidden' test_files[$((i++))]='a hidden file' test_files[$((i++))]='test space' test_files[$((i++))]='testfile, no extention, spaces in name' test_files[$((i++))]='test.txt' test_files[$((i++))]='testfile, extention, no spaces in name' test_files[$((i++))]='test.ab8e460eac3599549cfaa23a848635aa.txt' test_files[$((i++))]='testfile, With (wrong) md5sum, no spaces in name' test_files[$((i++))]='test spaced.ab8e460eac3599549cfaa23a848635aa.txt' test_files[$((i++))]='testfile, With (wrong) md5sum, spaces in name' test_files[$((i++))]='test.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt' test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, no spaces in name' test_files[$((i++))]='test spaced.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt'] test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, spaces in name' test_files[$((i++))]='test space.txt' test_files[$((i++))]='testfile, extention, spaces in name' test_files[$((i++))]='test multi-space .txt' test_files[$((i++))]='testfile, extention, multiple consequtive spaces in name' test_files[$((i++))]='test space.h' test_files[$((i++))]='testfile, short extention, spaces in name' test_files[$((i++))]='test space.reallylong' test_files[$((i++))]='testfile, long extention, spaces in name' test_files[$((i++))]='test space.reallyreallyreallylong.tst' test_files[$((i++))]='testfile, long extention, double extention, might look like hash, spaces in name' test_files[$((i++))]='utf8test1 - æeiaæå.txt' test_files[$((i++))]='testfile, extention, utf8 characters, spaces in name' test_files[$((i++))]='utf8test1 - 漢字.txt' test_files[$((i++))]='testfile, extention, Japanese utf8 characters, spaces in name' for s in . sub1 sub2 sub1/sub3 .hidden_dir do #note -p not needed as we create dirs top-down #fails for "." -- but the hack allows us to use a single loop #for creating testdata in all dirs mkdir $root_dir/$s dir=$root_dir/$s i=0 while [[ $i -lt ${#test_files[*]} ]] do filename=${test_files[$((i++))]} echo ${test_files[$((i++))]} > "$dir/$filename" done done echo "$root_dir" } # Run test, given a hash-type as first argument runtest() { sum=$1 root_dir=$(mktest) echo "created dir: $root_dir" echo "Running first test with hashtype $sum:" echo main $root_dir $sum echo echo "Running second test:" echo main $root_dir $sum echo "Updating all files:" find $root_dir -type f | while read f do echo "more content" >> "$f" done echo echo "Running final test:" echo main $root_dir $sum #cleanup: rm -r $root_dir } # Test md5 and whirlpool hashes on generated data. runtests() { runtest md5 runtest wp } #For in order to be able to call the script recursively, without splitting off # functions to separate files: case "$1" in 'test') runtests ;; 'hashname') hashname "$2" ;; 'hashmove') hashmove "$2" ;; 'run') main "$2" "$3" ;; *) echo "Use with: $0 test - or if you just want to try it on a folder:" echo " $0 run path (implies md5)" echo " $0 run md5 path" echo " $0 run wp path" ;; esac
      
      #!/usr/bin/perl -w
      # whirlpool-rename.pl
      # 2009 Peter Cordes <peter@cordes.ca>.  Share and Enjoy!
      
      use Fcntl;      # for O_BINARY
      use File::Find;
      use Digest::Whirlpool;
      
      # find callback, called once per directory entry
      # $_ is the base name of the file, and we are chdired to that directory.
      sub whirlpool_rename {
          print "find: $_\n";
      #    my @components = split /\.(?:[[:xdigit:]]{128})?/; # remove .hash while we're at it
          my @components = split /\.(?!\.|$)/, $_, -1; # -1 to not leave out trailing dots
      
          if (!$components[0] && $_ ne ".") { # hidden file/directory
              $File::Find::prune = 1;
              return;
          }
      
          # don't follow symlinks or process non-regular-files
          return if (-l $_ || ! -f _);
      
          my $digest;
          eval {
              sysopen(my $fh, $_, O_RDONLY | O_BINARY) or die "$!";
              $digest = Digest->new( 'Whirlpool' )->addfile($fh);
          };
          if ($@) {  # exception-catching structure from whirlpoolsum, distributed with Digest::Whirlpool.
              warn "whirlpool: couldn't hash $_: $!\n";
              return;
          }
      
          # strip old hashes from the name.  not done during split only in the interests of readability
          @components = grep { !/^[[:xdigit:]]{128}$/ }  @components;
          if ($#components == 0) {
              push @components, $digest->hexdigest;
          } else {
              my $ext = pop @components;
              push @components, $digest->hexdigest, $ext;
          }
      
          my $newname = join('.', @components);
          return if $_ eq $newname;
          print "rename  $_ ->  $newname\n";
          if (-e $newname) {
              warn "whirlpool: clobbering $newname\n";
              # maybe unlink $_ and return if $_ is older than $newname?
              # But you'd better check that $newname has the right contents then...
          }
          # This could be link instead of rename, but then you'd have to handle directories, and you can't make hardlinks across filesystems
          rename $_, $newname or warn "whirlpool: couldn't rename $_ -> $newname:  $!\n";
      }
      
      
      #main
      $ARGV[0] = "." if !@ARGV;  # default to current directory
      find({wanted => \&whirlpool_rename, no_chdir => 0}, @ARGV );
      
      find -name '.?*' -prune -o \( -type f -print0 \)
      
      #!/bin/bash
      if (($# != 1)) || ! [[ -d "$1" ]]; then
          echo "Usage: $0 /path/to/directory"
          exit 1
      fi
      
      is_hash() {
       md5=${1##*.} # strip prefix
       [[ "$md5" == *[^[:xdigit:]]* || ${#md5} -lt 32 ]] && echo "$1" || echo "${1%.*}"
      }
      
      while IFS= read -r -d $'\0' file; do
          read hash junk < <(md5sum "$file")
          basename="${file##*/}"
          dirname="${file%/*}"
          pre_ext="${basename%.*}"
          ext="${basename:${#pre_ext}}"
      
          # File already hashed?
          pre_ext=$(is_hash "$pre_ext")
          ext=$(is_hash "$ext")
      
          mv "$file" "${dirname}/${pre_ext}.${hash}${ext}" 2> /dev/null
      
      done < <(find "$1" -path "*/.*" -prune -o \( -type f -print0 \))
      
      $ tree -a a a |-- .hidden_dir | `-- foo |-- b | `-- c.d | |-- f | |-- g.5236b1ab46088005ed3554940390c8a7.ext | |-- h.d41d8cd98f00b204e9800998ecf8427e | |-- i.ext1.5236b1ab46088005ed3554940390c8a7.ext2 | `-- j.ext1.ext2 |-- c.ext^Mnewline | |-- f | `-- g.with[or].ext `-- f^Jnewline.ext 4 directories, 9 files $ tree -a a a |-- .hidden_dir | `-- foo |-- b | `-- c.d | |-- f.d41d8cd98f00b204e9800998ecf8427e | |-- g.d41d8cd98f00b204e9800998ecf8427e.ext | |-- h.d41d8cd98f00b204e9800998ecf8427e | |-- i.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2 | `-- j.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2 |-- c.ext^Mnewline | |-- f.d41d8cd98f00b204e9800998ecf8427e | `-- g.with[or].d41d8cd98f00b204e9800998ecf8427e.ext `-- f^Jnewline.d3b07384d113edec49eaa6238ad5ff00.ext 4 directories, 9 files