Python 散列多个文件 问题说明:
给定一个目录,我想遍历该目录及其非隐藏子目录,Python 散列多个文件 问题说明:,python,perl,bash,hash,batch-processing,Python,Perl,Bash,Hash,Batch Processing,给定一个目录,我想遍历该目录及其非隐藏子目录, 并在非隐藏的 文件名。 如果重新运行脚本,它将用新的哈希替换旧的哈希 ==>。 。==>。 问题: a) 你会怎么做? b) 在所有可用的方法中,什么使您的方法最合适? 判决: 谢谢大家,我选择SeigeX的答案是因为它的速度和便携性。 它比其他bash变体更快, 它在我的MacOSX机器上运行,没有任何改动 您可能希望将结果存储在一个文件中,如 find . -type f -exec md5sum {} \; > MD5SUMS 如
并在非隐藏的 文件名。
如果重新运行脚本,它将用新的哈希替换旧的哈希
==>。
。
==>。
问题: a) 你会怎么做? b) 在所有可用的方法中,什么使您的方法最合适?
判决: 谢谢大家,我选择SeigeX的答案是因为它的速度和便携性。
它比其他bash变体更快,
它在我的MacOSX机器上运行,没有任何改动
您可能希望将结果存储在一个文件中,如
find . -type f -exec md5sum {} \; > MD5SUMS
如果确实希望每个散列一个文件:
find . -type f | while read f; do g=`md5sum $f` > $f.md5; done
甚至
find . -type f | while read f; do g=`md5sum $f | awk '{print $1}'`; echo "$g $f"> $f-$g.md5; done
使用zsh:
$ ls
a.txt
b.txt
c.txt
魔法:
$ FILES=**/*(.)
$ # */ stupid syntax coloring thinks this is a comment
$ for f in $FILES; do hash=`md5sum $f | cut -f1 -d" "`; mv $f "$f:r.$hash.$f:e"; done
$ ls
a.60b725f10c9c85c70d97880dfe8191b3.txt
b.3b5d5c3712955042212316173ccf37be.txt
c.2cd6ee2c70b0bde53fbe6cac3c8b8bb1.txt
快乐解构
编辑:在sh或bash的两个版本中,在子目录和
mv
argument周围添加文件。一种是将自身限制为具有扩展名的文件
find . -type f -print | while read file
do
hash=`$hashcommand "$file"`
filename=${file%.*}
extension=${file##*.}
mv $file "$filename.$hash.$extension"
done
hash () {
#openssl md5 t.sh | sed -e 's/.* //'
whirlpool "$f"
}
find . -type f -a -name '*.*' | while read f; do
# remove the echo to run this for real
echo mv "$f" "${f%.*}.whirlpool-`hash "$f"`.${f##*.}"
done
测试
...
mv ./bash-4.0/signames.h ./bash-4.0/signames.whirlpool-d71b117a822394a5b273ea6c0e3f4dc045b1098326d39864564f1046ab7bd9296d5533894626288265a1f70638ee3ecce1f6a22739b389ff7cb1fa48c76fa166.h
...
这个更复杂的版本处理所有普通文件,有或没有扩展名,有或没有空格和奇数字符,等等
hash () {
#openssl md5 t.sh | sed -e 's/.* //'
whirlpool "$f"
}
find . -type f | while read f; do
name=${f##*/}
case "$name" in
*.*) extension=".${name##*.}" ;;
*) extension= ;;
esac
# remove the echo to run this for real
echo mv "$f" "${f%/*}/${name%.*}.whirlpool-`hash "$f"`$extension"
done
- 在包含空格(如“a b”)的文件上测试
- 在包含多个扩展名(如“a.b.c”)的文件上进行测试
- 使用包含空格和/或点的目录进行测试
- 在包含点的目录(如“a.b/c”)中不包含扩展名的文件上进行测试
- 更新:如果文件更改,现在更新哈希
- 在读取-d$'\0'时,使用管道传输到
的
来正确处理文件名中的空格print0
- md5sum可以替换为您最喜欢的哈希函数。sed从md5sum的输出中删除第一个空格及其后面的所有内容
- 基本文件名是使用一个正则表达式提取的,该正则表达式查找最后一个没有后跟另一个斜杠的句点(这样,目录名中的句点就不会被计算为扩展名的一部分)
- 扩展名是通过使用起始索引为基本文件名长度的子字符串找到的
- 以下是我在bash中对它的看法。功能:跳过非常规文件;正确处理名称中带有奇怪字符(即空格)的文件;处理无扩展文件名;跳过已经散列的文件,以便可以重复运行(尽管如果在运行之间修改了文件,则会添加新的散列,而不是替换旧的散列)。我使用md5-q作为散列函数编写了它;您应该能够用任何其他内容替换它,只要它只输出散列,而不是像filename=>hash这样的内容
find -x . -type f -print0 | while IFS="" read -r -d $'\000' file; do
hash="$(md5 -q "$file")" # replace with your favorite hash function
[[ "$file" == *."$hash" ]] && continue # skip files that already end in their hash
dirname="$(dirname "$file")"
basename="$(basename "$file")"
base="${basename%.*}"
[[ "$base" == *."$hash" ]] && continue # skip files that already end in hash + extension
if [[ "$basename" == "$base" ]]; then
extension=""
else
extension=".${basename##*.}"
fi
mv "$file" "$dirname/$base.$hash$extension"
done
需求的逻辑非常复杂,足以证明使用Python而不是bash是合理的。它应该提供更具可读性、可扩展性和可维护性的解决方案
#!/usr/bin/env python
import hashlib, os
def ishash(h, size):
"""Whether `h` looks like hash's hex digest."""
if len(h) == size:
try:
int(h, 16) # whether h is a hex number
return True
except ValueError:
return False
for root, dirs, files in os.walk("."):
dirs[:] = [d for d in dirs if not d.startswith(".")] # skip hidden dirs
for path in (os.path.join(root, f) for f in files if not f.startswith(".")):
suffix = hash_ = "." + hashlib.md5(open(path).read()).hexdigest()
hashsize = len(hash_) - 1
# extract old hash from the name; add/replace the hash if needed
barepath, ext = os.path.splitext(path) # ext may be empty
if not ishash(ext[1:], hashsize):
suffix += ext # add original extension
barepath, oldhash = os.path.splitext(barepath)
if not ishash(oldhash[1:], hashsize):
suffix = oldhash + suffix # preserve 2nd (not a hash) extension
else: # ext looks like a hash
oldhash = ext
if hash_ != oldhash: # replace old hash by new one
os.rename(path, barepath+suffix)
这是一个测试目录树。它包括:
- 名称中带有点的目录中没有扩展名的文件
- 已包含哈希的文件名(幂等性测试)
- 带有两个扩展名的文件名
- 名称中的换行符
import mhash
print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()
输出:
CBDCA4520CC5C131FC3A86109DD23FEE2D7FF7BE5663D398180178378944A4F41480B938608AE98DA7ECCBF39A4C79B83A8590C4CB1BACE5BC638FC92B3E653
在Python中调用whirlpooldeep
可以为需要基于哈希值跟踪文件集的问题提供利用 whirlpool不是很常见的杂烩。您可能需要安装一个程序来计算它。e、 Debian/Ubuntu包含一个“惠而浦”软件包。程序自己打印一个文件的散列。apt cache search whirlpool显示,其他一些软件包也支持它,包括有趣的md5deep 一些早期的Anwser在文件名中包含空格时会失败。如果是这种情况,但文件名中没有任何换行符,则可以安全地使用\n作为分隔符
oldifs="$IFS"
IFS="
"
for i in $(find -type f); do echo "$i";done
#output
# ./base
# ./base2
# ./normal.ext
# ./trick.e "xt
# ./foo bar.dir ext/trick' (name "- }$foo.ext{}.ext2
IFS="$oldifs"
尝试不设置IFS,看看它为什么重要
我打算用IFS=“”;读取数组时查找-print0 |,以拆分“.”字符,但我通常从不使用数组变量。我在手册页中看到,插入散列作为第二个最后一个数组索引,并向下推最后一个元素(文件扩展名,如果有的话)并不是一个简单的方法。每当bash数组变量看起来有趣时,我知道是时候用perl来做我正在做的事情了!请参阅有关使用读取的gotchas:
我决定使用另一种我喜欢的技术:find-exec sh-c。这是最安全的,因为您没有解析文件名
这应该可以做到:
find -regextype posix-extended -type f -not -regex '.*\.[a-fA-F0-9]{128}.*' \
-execdir bash -c 'for i in "${@#./}";do
hash=$(whirlpool "$i");
ext=".${i##*.}"; base="${i%.*}";
[ "$base" = "$i" ] && ext="";
newname="$base.$hash$ext";
echo "ext:$ext $i -> $newname";
false mv --no-clobber "$i" "$newname";done' \
dummy {} +
# take out the "false" before the mv, and optionally take out the echo.
# false ignores its arguments, so it's there so you can
# run this to see what will happen without actually renaming your files.
-execdir bash-c'cmd'dummy{}+在那里有一个dummy arg,因为命令后面的第一个arg在shell的位置参数中变成$0,而不是循环中的“$@”的一部分。我使用execdir而不是exec,因此我不必处理目录名(或者当实际文件名都足够短时,对于具有长名称的嵌套dir,可能会超过PATH_MAX)
-not-regex防止将此应用于同一文件两次。虽然whirlpool是一个非常长的散列,mv说如果我在没有检查的情况下运行两次,文件名就太长了。(在XFS文件系统上。)
没有扩展名的文件获取basename.hash。我必须特别检查,以避免添加尾随符号,或将basename作为扩展名${@#./}去掉了每个文件名前面的前导字符./that find put,因此对于没有扩展名的文件,整个字符串中没有“.”
mv——任何clobber都不能是GNU扩展。如果您没有GNU mv,如果您希望避免删除现有文件,请执行其他操作(例如,您只运行一次,同一文件中的某些文件将以旧名称添加到目录中;您将再次运行该文件。)OTOH,如果您想要这种行为,请将其删除
即使文件名包含换行符,我的解决方案也应该有效
$ sudo apt-get install python-mhash
import mhash
print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()
from subprocess import PIPE, STDOUT, Popen
def getoutput(cmd):
return Popen(cmd, stdout=PIPE, stderr=STDOUT).communicate()[0]
hash_ = getoutput(["whirlpooldeep", "-q", path]).rstrip()
oldifs="$IFS"
IFS="
"
for i in $(find -type f); do echo "$i";done
#output
# ./base
# ./base2
# ./normal.ext
# ./trick.e "xt
# ./foo bar.dir ext/trick' (name "- }$foo.ext{}.ext2
IFS="$oldifs"
find -regextype posix-extended -type f -not -regex '.*\.[a-fA-F0-9]{128}.*' \
-execdir bash -c 'for i in "${@#./}";do
hash=$(whirlpool "$i");
ext=".${i##*.}"; base="${i%.*}";
[ "$base" = "$i" ] && ext="";
newname="$base.$hash$ext";
echo "ext:$ext $i -> $newname";
false mv --no-clobber "$i" "$newname";done' \
dummy {} +
# take out the "false" before the mv, and optionally take out the echo.
# false ignores its arguments, so it's there so you can
# run this to see what will happen without actually renaming your files.
#!/usr/bin/env ruby
require 'digest/md5'
Dir.glob('**/*') do |f|
next unless File.file? f
next if /\.md5sum-[0-9a-f]{32}/ =~ f
md5sum = Digest::MD5.file f
newname = "%s/%s.md5sum-%s%s" %
[File.dirname(f), File.basename(f,'.*'), md5sum, File.extname(f)]
File.rename f, newname
end
#!/usr/bin/env bash
#Tested with:
# GNU bash, version 4.0.28(1)-release (x86_64-pc-linux-gnu)
# ksh (AT&T Research) 93s+ 2008-01-31
# mksh @(#)MIRBSD KSH R39 2009/08/01 Debian 39.1-4
# Does not work with pdksh, dash
DEFAULT_SUM="md5"
#Takes a parameter, as root path
# as well as an optional parameter, the hash function to use (md5 or wp for whirlpool).
main()
{
case $2 in
"wp")
export SUM="wp"
;;
"md5")
export SUM="md5"
;;
*)
export SUM=$DEFAULT_SUM
;;
esac
# For all visible files in all visible subfolders, move the file
# to a name including the correct hash:
find $1 -type f -not -regex '.*/\..*' -exec $0 hashmove '{}' \;
}
# Given a file named in $1 with full path, calculate it's hash.
# Output the filname, with the hash inserted before the extention
# (if any) -- or: replace an existing hash with the new one,
# if a hash already exist.
hashname_md5()
{
pathname="$1"
full_hash=`md5sum "$pathname"`
hash=${full_hash:0:32}
filename=`basename "$pathname"`
prefix=${filename%%.*}
suffix=${filename#$prefix}
#If the suffix starts with something that looks like an md5sum,
#remove it:
suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{32}//'`
echo "$prefix.$hash$suffix"
}
# Same as hashname_md5 -- but uses whirlpool hash.
hashname_wp()
{
pathname="$1"
hash=`whirlpool "$pathname"`
filename=`basename "$pathname"`
prefix=${filename%%.*}
suffix=${filename#$prefix}
#If the suffix starts with something that looks like an md5sum,
#remove it:
suffix=`echo $suffix|sed -r 's/\.[a-z0-9]{128}//'`
echo "$prefix.$hash$suffix"
}
#Given a filepath $1, move/rename it to a name including the filehash.
# Try to replace an existing hash, an not move a file if no update is
# needed.
hashmove()
{
pathname="$1"
filename=`basename "$pathname"`
path="${pathname%%/$filename}"
case $SUM in
"wp")
hashname=`hashname_wp "$pathname"`
;;
"md5")
hashname=`hashname_md5 "$pathname"`
;;
*)
echo "Unknown hash requested"
exit 1
;;
esac
if [[ "$filename" != "$hashname" ]]
then
echo "renaming: $pathname => $path/$hashname"
mv "$pathname" "$path/$hashname"
else
echo "$pathname up to date"
fi
}
# Create som testdata under /tmp
mktest()
{
root_dir=$(tempfile)
rm "$root_dir"
mkdir "$root_dir"
i=0
test_files[$((i++))]='test'
test_files[$((i++))]='testfile, no extention or spaces'
test_files[$((i++))]='.hidden'
test_files[$((i++))]='a hidden file'
test_files[$((i++))]='test space'
test_files[$((i++))]='testfile, no extention, spaces in name'
test_files[$((i++))]='test.txt'
test_files[$((i++))]='testfile, extention, no spaces in name'
test_files[$((i++))]='test.ab8e460eac3599549cfaa23a848635aa.txt'
test_files[$((i++))]='testfile, With (wrong) md5sum, no spaces in name'
test_files[$((i++))]='test spaced.ab8e460eac3599549cfaa23a848635aa.txt'
test_files[$((i++))]='testfile, With (wrong) md5sum, spaces in name'
test_files[$((i++))]='test.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt'
test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, no spaces in name'
test_files[$((i++))]='test spaced.8072ec03e95a26bb07d6e163c93593283fee032db7265a29e2430004eefda22ce096be3fa189e8988c6ad77a3154af76f582d7e84e3f319b798d369352a63c3d.txt']
test_files[$((i++))]='testfile, With (wrong) whirlpoolhash, spaces in name'
test_files[$((i++))]='test space.txt'
test_files[$((i++))]='testfile, extention, spaces in name'
test_files[$((i++))]='test multi-space .txt'
test_files[$((i++))]='testfile, extention, multiple consequtive spaces in name'
test_files[$((i++))]='test space.h'
test_files[$((i++))]='testfile, short extention, spaces in name'
test_files[$((i++))]='test space.reallylong'
test_files[$((i++))]='testfile, long extention, spaces in name'
test_files[$((i++))]='test space.reallyreallyreallylong.tst'
test_files[$((i++))]='testfile, long extention, double extention,
might look like hash, spaces in name'
test_files[$((i++))]='utf8test1 - æeiaæå.txt'
test_files[$((i++))]='testfile, extention, utf8 characters, spaces in name'
test_files[$((i++))]='utf8test1 - 漢字.txt'
test_files[$((i++))]='testfile, extention, Japanese utf8 characters, spaces in name'
for s in . sub1 sub2 sub1/sub3 .hidden_dir
do
#note -p not needed as we create dirs top-down
#fails for "." -- but the hack allows us to use a single loop
#for creating testdata in all dirs
mkdir $root_dir/$s
dir=$root_dir/$s
i=0
while [[ $i -lt ${#test_files[*]} ]]
do
filename=${test_files[$((i++))]}
echo ${test_files[$((i++))]} > "$dir/$filename"
done
done
echo "$root_dir"
}
# Run test, given a hash-type as first argument
runtest()
{
sum=$1
root_dir=$(mktest)
echo "created dir: $root_dir"
echo "Running first test with hashtype $sum:"
echo
main $root_dir $sum
echo
echo "Running second test:"
echo
main $root_dir $sum
echo "Updating all files:"
find $root_dir -type f | while read f
do
echo "more content" >> "$f"
done
echo
echo "Running final test:"
echo
main $root_dir $sum
#cleanup:
rm -r $root_dir
}
# Test md5 and whirlpool hashes on generated data.
runtests()
{
runtest md5
runtest wp
}
#For in order to be able to call the script recursively, without splitting off
# functions to separate files:
case "$1" in
'test')
runtests
;;
'hashname')
hashname "$2"
;;
'hashmove')
hashmove "$2"
;;
'run')
main "$2" "$3"
;;
*)
echo "Use with: $0 test - or if you just want to try it on a folder:"
echo " $0 run path (implies md5)"
echo " $0 run md5 path"
echo " $0 run wp path"
;;
esac
#!/usr/bin/perl -w
# whirlpool-rename.pl
# 2009 Peter Cordes <peter@cordes.ca>. Share and Enjoy!
use Fcntl; # for O_BINARY
use File::Find;
use Digest::Whirlpool;
# find callback, called once per directory entry
# $_ is the base name of the file, and we are chdired to that directory.
sub whirlpool_rename {
print "find: $_\n";
# my @components = split /\.(?:[[:xdigit:]]{128})?/; # remove .hash while we're at it
my @components = split /\.(?!\.|$)/, $_, -1; # -1 to not leave out trailing dots
if (!$components[0] && $_ ne ".") { # hidden file/directory
$File::Find::prune = 1;
return;
}
# don't follow symlinks or process non-regular-files
return if (-l $_ || ! -f _);
my $digest;
eval {
sysopen(my $fh, $_, O_RDONLY | O_BINARY) or die "$!";
$digest = Digest->new( 'Whirlpool' )->addfile($fh);
};
if ($@) { # exception-catching structure from whirlpoolsum, distributed with Digest::Whirlpool.
warn "whirlpool: couldn't hash $_: $!\n";
return;
}
# strip old hashes from the name. not done during split only in the interests of readability
@components = grep { !/^[[:xdigit:]]{128}$/ } @components;
if ($#components == 0) {
push @components, $digest->hexdigest;
} else {
my $ext = pop @components;
push @components, $digest->hexdigest, $ext;
}
my $newname = join('.', @components);
return if $_ eq $newname;
print "rename $_ -> $newname\n";
if (-e $newname) {
warn "whirlpool: clobbering $newname\n";
# maybe unlink $_ and return if $_ is older than $newname?
# But you'd better check that $newname has the right contents then...
}
# This could be link instead of rename, but then you'd have to handle directories, and you can't make hardlinks across filesystems
rename $_, $newname or warn "whirlpool: couldn't rename $_ -> $newname: $!\n";
}
#main
$ARGV[0] = "." if !@ARGV; # default to current directory
find({wanted => \&whirlpool_rename, no_chdir => 0}, @ARGV );
find -name '.?*' -prune -o \( -type f -print0 \)
#!/bin/bash
if (($# != 1)) || ! [[ -d "$1" ]]; then
echo "Usage: $0 /path/to/directory"
exit 1
fi
is_hash() {
md5=${1##*.} # strip prefix
[[ "$md5" == *[^[:xdigit:]]* || ${#md5} -lt 32 ]] && echo "$1" || echo "${1%.*}"
}
while IFS= read -r -d $'\0' file; do
read hash junk < <(md5sum "$file")
basename="${file##*/}"
dirname="${file%/*}"
pre_ext="${basename%.*}"
ext="${basename:${#pre_ext}}"
# File already hashed?
pre_ext=$(is_hash "$pre_ext")
ext=$(is_hash "$ext")
mv "$file" "${dirname}/${pre_ext}.${hash}${ext}" 2> /dev/null
done < <(find "$1" -path "*/.*" -prune -o \( -type f -print0 \))
$ tree -a a
a
|-- .hidden_dir
| `-- foo
|-- b
| `-- c.d
| |-- f
| |-- g.5236b1ab46088005ed3554940390c8a7.ext
| |-- h.d41d8cd98f00b204e9800998ecf8427e
| |-- i.ext1.5236b1ab46088005ed3554940390c8a7.ext2
| `-- j.ext1.ext2
|-- c.ext^Mnewline
| |-- f
| `-- g.with[or].ext
`-- f^Jnewline.ext
4 directories, 9 files
$ tree -a a
a
|-- .hidden_dir
| `-- foo
|-- b
| `-- c.d
| |-- f.d41d8cd98f00b204e9800998ecf8427e
| |-- g.d41d8cd98f00b204e9800998ecf8427e.ext
| |-- h.d41d8cd98f00b204e9800998ecf8427e
| |-- i.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
| `-- j.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|-- c.ext^Mnewline
| |-- f.d41d8cd98f00b204e9800998ecf8427e
| `-- g.with[or].d41d8cd98f00b204e9800998ecf8427e.ext
`-- f^Jnewline.d3b07384d113edec49eaa6238ad5ff00.ext
4 directories, 9 files