&引用;找到-regex…“;在Python中,或者如何查找其全名(路径+;名称)与正则表达式匹配的文件?
我想查找其全名(相对,尽管绝对也很好)与给定正则表达式匹配的文件(例如,类似于&引用;找到-regex…“;在Python中,或者如何查找其全名(路径+;名称)与正则表达式匹配的文件?,python,regex,find,Python,Regex,Find,我想查找其全名(相对,尽管绝对也很好)与给定正则表达式匹配的文件(例如,类似于glob模块,但用于正则表达式匹配而不是shell通配符匹配)。使用find,可以执行以下操作,例如: find-regex./foo/\w+/bar/[0-9]+-\w+.dat 当然,我可以通过os.system(…)或os.exec*(…)使用find,但我正在寻找一个纯Python解决方案。下面的代码结合了os.walk(…)和re模块正则表达式,是一个简单的Python解决方案。(它不够健壮,并且遗漏了许多
glob
模块,但用于正则表达式匹配而不是shell通配符匹配)。使用find
,可以执行以下操作,例如:
find-regex./foo/\w+/bar/[0-9]+-\w+.dat
当然,我可以通过os.system(…)
或os.exec*(…)
使用find
,但我正在寻找一个纯Python解决方案。下面的代码结合了os.walk(…)
和re
模块正则表达式,是一个简单的Python解决方案。(它不够健壮,并且遗漏了许多(不太常见的)角落案例,但对于我的单一用途来说已经足够好了,可以定位特定的数据文件以一次性插入数据库。)
但这是低效的。内容与正则表达式不匹配的子树(例如,/foo/\w+/baz/
,从上面的例子继续)被不必要地遍历。理想情况下,这些子树应该从行走中剪掉;不应遍历路径名与正则表达式不部分匹配的任何子目录。(我猜GNUfind
实现了这样的优化,但我还没有通过测试或源代码阅读确认这一点。)
有谁知道一个基于健壮正则表达式的find
的Python实现,理想情况下使用子树修剪优化?我希望我只是错过了os.path
模块或某个第三方模块中的一个方法。来自帮助(os.walk)
:
当topdown为true时,调用方可以就地修改dirnames列表
(例如,通过del或slice赋值)和walk只会递归到
名称保留为dirnames的子目录;这可以用来
删除搜索
因此,一旦一个子目录(列在dirnames
中)被确定为不可滥用,就应该将其从dirnames
中删除。这将生成您正在寻找的子树修剪。(只需确保首先从末尾删除dirnames
中的del
项,这样就不会更改要删除的剩余项的索引。)
使用如下目录结构运行脚本:
~/test% tree .
.
|-- foo
| `-- baz
| |-- bad
| | |-- bad1.txt
| | `-- badbad
| | `-- bad2.txt
| `-- bar
| |-- 1-good.dat
| `-- 2-good.dat
`-- tmp
|-- 000.png
|-- 001.png
`-- output.gif
屈服
pruning tmp
pruning foo/baz/bad
foo/baz/bar/2-good.dat
foo/baz/bar/1-good.dat
如果取消对“checking”print语句的注释,则很明显修剪后的目录不会被遍历。我编写了一个函数select_walk()来搜索和选择目录树中的文件 在以下示例中,搜索的文件是扩展名为
.dat
,.rtf
,.jpeg
的文件,这些文件位于名称与以下正则表达式模式匹配的目录中:
r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)
注意存在条件基本模式:
(?(1)TURI\1\d*|MONO\d+)
在基本模式b[ae]r(\d+)中的数字匹配组(\d+)的组引用(1)
和\1
。1 ) 以下是创建目录树的代码,以目录树为例: (注意,它首先删除目录“foo\”、“fooo\”、“froooo\”、“faooo\”,然后再创建它们) 此代码创建以下树:
J:
|
|--foo
| |--basil
| |--ber89
| |--TURI850
| |--file quetzal.jpeg
| |--file tehoi.txt
| |--TURI1023
| |--ber300
| |--poto%
| |--ocean
| |--file in ocean.rtf
| |--earth
| |--file curcuma in poto%.txt
| |--tamata
| |--vahine
| |--file tahiti.jpeg
| |--file kalaomi.xls
|
|--fooo
| |--york#
| |--noto
| |--nata
| |---file yorkshire.jpeg
| |--plain
| |--zx13ao
| |--ws89rt
| |--bar999
| |--TURI99905
| |--AERIAL
| |--bumbum
| |--corean
| |--minidisc
| |--file galileo.jpeg
| |--file polynesia.dat
| |--file concrete.txt
| |--TURI2227
| |--file Monroe.jpeg
| |--MONO2
| |--file elastic.jpeg
| |--atlantis
| |--atlABC
| |--atlantis_sound
| |--atlantis_image
| |--atlDEFG
|
|--froooo
| |--one_dir
| |--bar25
| |--TURI2501
| |--file matalello.jpeg
| |--file italy.dat
| |--file beretta.xls
| |--file turi2501_ser.rtf
| |--TURI2502
| |--file adamante.jpeg
| |--file egyptic.txt
| |--file urubu.rtf
| |--TURI4813
| |--file boaf_inTURI4813.jpeg
| |--file troui_inTURI4813.txt
| |--MONO8
| |--file in_mono8.dat
| |--file in_mono8.rtf
| |--file in_mono8.xls
| |--ber
| |--TURI30
| |--TURI
| |--MONO532
| |--file bacillus.jpeg
| |--file blueberry.dat
| |--file Perfume.doc
| |--file photo in one_dir.jpeg
| |--file tabula.xls
| |--another_dir
| |--notseen
| |--notseen2
|
|--faooo
| |--somolo-
| |--file ytek.rtf
| |--samala+
| |file kfaz.dat
| |--file 123.txt
| |--file 458.rtf
与文件匹配的正则表达式模式为:
r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
有选择地搜索此类文件的目录如下:
'J:\\fooo\\plain\\bar999\\TURI99905'
'J:\\froooo\\one_dir\\bar25\\TURI2501'
'J:\\froooo\\one_dir\\bar25\\TURI2502'
'J:\\froooo\\one_dir\\ber\\MONO532'
2 )
作为初步演示,下面的代码显示了select_walk()函数代码部分的功能,该部分代码构建了在树中迭代遍历期间仅浏览选定目录并返回选定文件所需的正则表达式:
import re
def compute_regexes(pat_file, displ = True):
from os import sep
splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)
pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])
if displ:
print ('IN FUNCTION compute_regexes() :'
'\n\npat_file== %s'
'\n\nsplitted_pat :\n%s'
'\n\npat_parent_dir== %s\n') \
% (pat_file , '\n'.join(splitted_pat) , pat_parent_dir)
dgr = {}
for i,el in enumerate(splitted_pat):
if re.search('\(.*?\)',el):
dgr[len(dgr)+1] = i
if displ:
print 'dgr :'
print '\n'.join('group(%s) is in splitted_pat[%s]' % (g,i)
for g,i in dgr.iteritems())
def repl(mat, dgr = dgr):
the = int(mat.group(1) if mat.group(1) else mat.group(2))
return str(the + dgr[the])
for i,el in enumerate(splitted_pat):
splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)
pat_dirs = ''
for x in splitted_pat[-2:0:-1]:
pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
pat_dirs = splitted_pat[0] + pat_dirs
if displ:
print '\npat_dirs==',pat_dirs
return (re.compile(pat_file), re.compile(pat_dirs), re.compile(pat_parent_dir) )
pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file)
print '\n\nEXAMPLES with regx_file :\n'
print 'pat_file==',pat_file
for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru.rtf ',
'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.jpeg '):
print filepath,bool(regx_file.match(filepath))
print '\n\nEXAMPLES with regx_dirs :\n'
for path in ('J:\\fooo',
'J:\\fooo\\basil',
'J:\\fooo\\basil\\ber92',
'J:\\fooo\\basil\\ber92\\TURI777',
'J:\\fooo\\basil\\ber92\\TURI9258',
'J:\\froooooo'
'J:\\froooooo\\ki_ki',
'J:\\froooooo\\ki_ki\\bar',
'J:\\froooooo\\ki=ki\\bar',
'J:\\froooooo\\ki_ki\\bar\MONO47'):
print path,(" : ~~ this dir's name is OK ~~" if path==''.join(regx_dirs.match(path).group())
else " : ## this dir's name doesn't match ##")
3 )
最后,这里是函数
选择_walk()
这就完成了在树中搜索名称与特定正则表达式匹配的文件的任务:它生成由内置的os.walk()函数返回的三元组(dirpath、dirnames、filenames),但只有目录filenames包含与pat\u file匹配的正确文件名的三元组 当然,在迭代过程中,函数select_walk()不会搜索那些文件内容永远不会与键regex模式pat_file匹配的目录,因为它们的(目录)名称
r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
'J:\\fooo\\plain\\bar999\\TURI99905'
'J:\\froooo\\one_dir\\bar25\\TURI2501'
'J:\\froooo\\one_dir\\bar25\\TURI2502'
'J:\\froooo\\one_dir\\ber\\MONO532'
import re
def compute_regexes(pat_file, displ = True):
from os import sep
splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)
pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])
if displ:
print ('IN FUNCTION compute_regexes() :'
'\n\npat_file== %s'
'\n\nsplitted_pat :\n%s'
'\n\npat_parent_dir== %s\n') \
% (pat_file , '\n'.join(splitted_pat) , pat_parent_dir)
dgr = {}
for i,el in enumerate(splitted_pat):
if re.search('\(.*?\)',el):
dgr[len(dgr)+1] = i
if displ:
print 'dgr :'
print '\n'.join('group(%s) is in splitted_pat[%s]' % (g,i)
for g,i in dgr.iteritems())
def repl(mat, dgr = dgr):
the = int(mat.group(1) if mat.group(1) else mat.group(2))
return str(the + dgr[the])
for i,el in enumerate(splitted_pat):
splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)
pat_dirs = ''
for x in splitted_pat[-2:0:-1]:
pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
pat_dirs = splitted_pat[0] + pat_dirs
if displ:
print '\npat_dirs==',pat_dirs
return (re.compile(pat_file), re.compile(pat_dirs), re.compile(pat_parent_dir) )
pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file)
print '\n\nEXAMPLES with regx_file :\n'
print 'pat_file==',pat_file
for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru.rtf ',
'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.jpeg '):
print filepath,bool(regx_file.match(filepath))
print '\n\nEXAMPLES with regx_dirs :\n'
for path in ('J:\\fooo',
'J:\\fooo\\basil',
'J:\\fooo\\basil\\ber92',
'J:\\fooo\\basil\\ber92\\TURI777',
'J:\\fooo\\basil\\ber92\\TURI9258',
'J:\\froooooo'
'J:\\froooooo\\ki_ki',
'J:\\froooooo\\ki_ki\\bar',
'J:\\froooooo\\ki=ki\\bar',
'J:\\froooooo\\ki_ki\\bar\MONO47'):
print path,(" : ~~ this dir's name is OK ~~" if path==''.join(regx_dirs.match(path).group())
else " : ## this dir's name doesn't match ##")
IN FUNCTION compute_regexes() :
pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)
splitted_pat :
J:
f[ruv]?o+
\w+
b[ae]r(\d+)?
(?(1)TURI\1\d*|MONO\d+)
\w+\.(dat|rtf|jpeg)
pat_parent_dir== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)
dgr :
group(1) is in splitted_pat[3]
group(2) is in splitted_pat[4]
group(3) is in splitted_pat[5]
pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?
EXAMPLES with regx_file :
pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)
J:\fooo\basil\ber92\TURI9258\beru.rtf True
J:\froooooo\ki_ki\bar\MONO47\madrid.jpeg True
EXAMPLES with regx_dirs :
J:\fooo : ~~ this dir's name is OK ~~
J:\fooo\basil : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92 : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92\TURI777 : ## this dir's name doesn't match ##
J:\fooo\basil\ber92\TURI9258 : ~~ this dir's name is OK ~~
J:\frooooooJ:\froooooo\ki_ki : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar : ~~ this dir's name is OK ~~
J:\froooooo\ki=ki\bar : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar\MONO47 : ~~ this dir's name is OK ~~
def select_walk(pat_file,start_dir):
from os import sep
splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)
pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])
dgr = {}
for i,el in enumerate(splitted_pat):
if re.search('\(.*?\)',el):
dgr[len(dgr)+1] = i
def repl(mat, dgr = dgr):
the = int(mat.group(1) if mat.group(1) else mat.group(2))
return str(the + dgr[the])
for i,el in enumerate(splitted_pat):
splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)
pat_dirs = ''
for x in splitted_pat[-2:0:-1]:
pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
pat_dirs = splitted_pat[0] + pat_dirs
print 'pat_dirs==',pat_dirs
regx_file = re.compile(pat_file)
regx_dirs = re.compile(pat_dirs)
regx_parent_dir = re.compile(pat_parent_dir)
start_dir = start_dir.rstrip(sep) + sep
print '\nstart_dir == '+start_dir
for dirpath,dirnames,filenames in os.walk(start_dir):
dirpath = dirpath.rstrip(sep)
print '\n'.join(('explored dirpath : %s is_direct_parent: %s' \
% (dirpath,('NO','YES')[bool(regx_parent_dir.match(dirpath))]),
' dirnames : %s' % dirnames,
' filenames : %s' % filenames))
if regx_parent_dir.match(dirpath):
filenames[:] = [filename for filename in filenames
if regx_file.match(dirpath + sep + filename)]
dirnames[:] = []
print '\n'.join((' dirnames : not to be explored ' ,
' yielded filenames : %s\n' % filenames))
yield (dirpath,dirnames,filenames)
else:
dirnames[:] = [dirname for dirname in dirnames
if regx_dirs.match(dirpath + sep + dirname).group()==dirpath + sep + dirname]
print '\n'.join(('dirnames to explore : %s ' % dirnames,
' filenames : not to be yielded\n'))
pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
print '\n\nSELECTED (dirpath, dirnames, filenames) :\n' + '\n'.join(map(repr, select_walk(pat_file,'J:\\')))
pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?
start_dir == J:\
explored dirpath : J: is_direct_parent: NO
dirnames : ['Amazon', 'faooo', 'Favorites', 'foo', 'fooo', 'froooo', 'Python', 'RECYCLER', 'System Volume Information']
filenames : ['image00.pfm', 'rep.py']
dirnames to explore : ['foo', 'fooo', 'froooo']
filenames : not to be yielded
explored dirpath : J:\foo is_direct_parent: NO
dirnames : ['basil', 'poto%', 'tamata']
filenames : ['kalaomi.xls']
dirnames to explore : ['basil', 'tamata']
filenames : not to be yielded
explored dirpath : J:\foo\basil is_direct_parent: NO
dirnames : ['ber300', 'ber89']
filenames : []
dirnames to explore : ['ber300', 'ber89']
filenames : not to be yielded
explored dirpath : J:\foo\basil\ber300 is_direct_parent: NO
dirnames : []
filenames : []
dirnames to explore : []
filenames : not to be yielded
explored dirpath : J:\foo\basil\ber89 is_direct_parent: NO
dirnames : ['TURI1023', 'TURI850']
filenames : []
dirnames to explore : []
filenames : not to be yielded
explored dirpath : J:\foo\tamata is_direct_parent: NO
dirnames : ['vahine']
filenames : []
dirnames to explore : []
filenames : not to be yielded
explored dirpath : J:\fooo is_direct_parent: NO
dirnames : ['atlantis', 'plain', 'york#']
filenames : []
dirnames to explore : ['atlantis', 'plain']
filenames : not to be yielded
explored dirpath : J:\fooo\atlantis is_direct_parent: NO
dirnames : ['atlABC', 'atlDEFG']
filenames : []
dirnames to explore : []
filenames : not to be yielded
explored dirpath : J:\fooo\plain is_direct_parent: NO
dirnames : ['bar999', 'ws89rt', 'zx13ao']
filenames : []
dirnames to explore : ['bar999']
filenames : not to be yielded
explored dirpath : J:\fooo\plain\bar999 is_direct_parent: NO
dirnames : ['MONO2', 'TURI2227', 'TURI99905']
filenames : []
dirnames to explore : ['TURI99905']
filenames : not to be yielded
explored dirpath : J:\fooo\plain\bar999\TURI99905 is_direct_parent: YES
dirnames : ['AERIAL', 'minidisc']
filenames : ['concrete.txt', 'galileo.jpeg', 'polynesia.dat']
dirnames : not to be explored
yielded filenames : ['galileo.jpeg', 'polynesia.dat']
explored dirpath : J:\froooo is_direct_parent: NO
dirnames : ['another_dir', 'one_dir']
filenames : []
dirnames to explore : ['another_dir', 'one_dir']
filenames : not to be yielded
explored dirpath : J:\froooo\another_dir is_direct_parent: NO
dirnames : ['notseen', 'notseen2']
filenames : []
dirnames to explore : []
filenames : not to be yielded
explored dirpath : J:\froooo\one_dir is_direct_parent: NO
dirnames : ['bar25', 'ber']
filenames : ['photo in one_dir.jpeg', 'tabula.xls']
dirnames to explore : ['bar25', 'ber']
filenames : not to be yielded
explored dirpath : J:\froooo\one_dir\bar25 is_direct_parent: NO
dirnames : ['MONO8', 'TURI2501', 'TURI2502', 'TURI4813']
filenames : []
dirnames to explore : ['TURI2501', 'TURI2502']
filenames : not to be yielded
explored dirpath : J:\froooo\one_dir\bar25\TURI2501 is_direct_parent: YES
dirnames : []
filenames : ['beretta.xls', 'italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']
dirnames : not to be explored
yielded filenames : ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']
explored dirpath : J:\froooo\one_dir\bar25\TURI2502 is_direct_parent: YES
dirnames : []
filenames : ['adamante.jpeg', 'egyptic.txt', 'urubu.rtf']
dirnames : not to be explored
yielded filenames : ['adamante.jpeg', 'urubu.rtf']
explored dirpath : J:\froooo\one_dir\ber is_direct_parent: NO
dirnames : ['MONO532', 'TURI', 'TURI30']
filenames : []
dirnames to explore : ['MONO532']
filenames : not to be yielded
explored dirpath : J:\froooo\one_dir\ber\MONO532 is_direct_parent: YES
dirnames : []
filenames : ['bacillus.jpeg', 'blueberry.dat', 'Perfume.doc']
dirnames : not to be explored
yielded filenames : ['bacillus.jpeg', 'blueberry.dat']
SELECTED (dirpath, dirnames, filenames) :
('J:\\fooo\\plain\\bar999\\TURI99905', [], ['galileo.jpeg', 'polynesia.dat'])
('J:\\froooo\\one_dir\\bar25\\TURI2501', [], ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf'])
('J:\\froooo\\one_dir\\bar25\\TURI2502', [], ['adamante.jpeg', 'urubu.rtf'])
('J:\\froooo\\one_dir\\ber\\MONO532', [], ['bacillus.jpeg', 'blueberry.dat'])