&引用;找到-regex…“;在Python中,或者如何查找其全名(路径+;名称)与正则表达式匹配的文件?

&引用;找到-regex…“;在Python中,或者如何查找其全名(路径+;名称)与正则表达式匹配的文件?,python,regex,find,Python,Regex,Find,我想查找其全名(相对,尽管绝对也很好)与给定正则表达式匹配的文件(例如,类似于glob模块,但用于正则表达式匹配而不是shell通配符匹配)。使用find,可以执行以下操作,例如: find-regex./foo/\w+/bar/[0-9]+-\w+.dat 当然,我可以通过os.system(…)或os.exec*(…)使用find,但我正在寻找一个纯Python解决方案。下面的代码结合了os.walk(…)和re模块正则表达式,是一个简单的Python解决方案。(它不够健壮,并且遗漏了许多

我想查找其全名(相对,尽管绝对也很好)与给定正则表达式匹配的文件(例如,类似于
glob
模块,但用于正则表达式匹配而不是shell通配符匹配)。使用
find
,可以执行以下操作,例如:

find-regex./foo/\w+/bar/[0-9]+-\w+.dat
当然,我可以通过
os.system(…)
os.exec*(…)
使用
find
,但我正在寻找一个纯Python解决方案。下面的代码结合了
os.walk(…)
re
模块正则表达式,是一个简单的Python解决方案。(它不够健壮,并且遗漏了许多(不太常见的)角落案例,但对于我的单一用途来说已经足够好了,可以定位特定的数据文件以一次性插入数据库。)

但这是低效的。内容与正则表达式不匹配的子树(例如,
/foo/\w+/baz/
,从上面的例子继续)被不必要地遍历。理想情况下,这些子树应该从行走中剪掉;不应遍历路径名与正则表达式不部分匹配的任何子目录。(我猜GNU
find
实现了这样的优化,但我还没有通过测试或源代码阅读确认这一点。)

有谁知道一个基于健壮正则表达式的
find
的Python实现,理想情况下使用子树修剪优化?我希望我只是错过了
os.path
模块或某个第三方模块中的一个方法。

来自
帮助(os.walk)

当topdown为true时,调用方可以就地修改dirnames列表 (例如,通过del或slice赋值)和walk只会递归到 名称保留为dirnames的子目录;这可以用来 删除搜索

因此,一旦一个子目录(列在
dirnames
中)被确定为不可滥用,就应该将其从
dirnames
中删除。这将生成您正在寻找的子树修剪。(只需确保首先从末尾删除
dirnames
中的
del
项,这样就不会更改要删除的剩余项的索引。)

使用如下目录结构运行脚本:

~/test% tree .
.
|-- foo
|   `-- baz
|       |-- bad
|       |   |-- bad1.txt
|       |   `-- badbad
|       |       `-- bad2.txt
|       `-- bar
|           |-- 1-good.dat
|           `-- 2-good.dat
`-- tmp
    |-- 000.png
    |-- 001.png
    `-- output.gif
屈服

pruning tmp
pruning foo/baz/bad
foo/baz/bar/2-good.dat
foo/baz/bar/1-good.dat

如果取消对“checking”print语句的注释,则很明显修剪后的目录不会被遍历。

我编写了一个函数select_walk()来搜索和选择目录树中的文件

在以下示例中,搜索的文件是扩展名为
.dat
.rtf
.jpeg
的文件,这些文件位于名称与以下正则表达式模式匹配的目录中:

r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)
注意存在条件基本模式:

(?(1)TURI\1\d*|MONO\d+)
在基本模式b[ae]r(\d+)中的数字匹配组(\d+)的组引用
(1)
\1

1 ) 以下是创建目录树的代码,以目录树为例:

(注意,它首先删除目录“foo\”、“fooo\”、“froooo\”、“faooo\”,然后再创建它们)

此代码创建以下树:

J:
|
|--foo
|   |--basil
|      |--ber89
|         |--TURI850
|            |--file quetzal.jpeg
|            |--file tehoi.txt
|         |--TURI1023
|      |--ber300
|   |--poto%
|      |--ocean
|         |--file in ocean.rtf
|      |--earth
|      |--file curcuma in poto%.txt
|   |--tamata
|      |--vahine
|         |--file tahiti.jpeg
|   |--file kalaomi.xls
|
|--fooo
|  |--york#
|     |--noto
|     |--nata
|     |---file yorkshire.jpeg
|  |--plain
|     |--zx13ao
|     |--ws89rt
|     |--bar999
|        |--TURI99905
|           |--AERIAL
|              |--bumbum
|              |--corean
|           |--minidisc
|           |--file galileo.jpeg
|           |--file polynesia.dat
|           |--file concrete.txt
|        |--TURI2227
|           |--file Monroe.jpeg
|        |--MONO2
|           |--file elastic.jpeg
|  |--atlantis
|     |--atlABC
|        |--atlantis_sound
|        |--atlantis_image
|     |--atlDEFG
|
|--froooo
|  |--one_dir
|     |--bar25
|        |--TURI2501
|           |--file matalello.jpeg
|           |--file italy.dat
|           |--file beretta.xls
|           |--file turi2501_ser.rtf
|        |--TURI2502
|           |--file adamante.jpeg
|           |--file egyptic.txt
|           |--file urubu.rtf
|        |--TURI4813
|           |--file boaf_inTURI4813.jpeg
|           |--file troui_inTURI4813.txt
|        |--MONO8
|           |--file in_mono8.dat
|           |--file in_mono8.rtf
|           |--file in_mono8.xls
|     |--ber
|        |--TURI30
|        |--TURI
|        |--MONO532
|           |--file bacillus.jpeg
|           |--file blueberry.dat
|           |--file Perfume.doc
|     |--file photo in one_dir.jpeg
|     |--file tabula.xls
|  |--another_dir
|     |--notseen
|     |--notseen2
|
|--faooo
|  |--somolo-
|     |--file ytek.rtf
|  |--samala+
|     |file kfaz.dat
|  |--file 123.txt
|  |--file 458.rtf
与文件匹配的正则表达式模式为:

r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
有选择地搜索此类文件的目录如下:

'J:\\fooo\\plain\\bar999\\TURI99905'
'J:\\froooo\\one_dir\\bar25\\TURI2501'
'J:\\froooo\\one_dir\\bar25\\TURI2502'
'J:\\froooo\\one_dir\\ber\\MONO532'

2 ) 作为初步演示,下面的代码显示了select_walk()函数代码部分的功能,该部分代码构建了在树中迭代遍历期间仅浏览选定目录并返回选定文件所需的正则表达式:

import re


def compute_regexes(pat_file, displ = True):
    from os import sep

    splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)

    pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])

    if displ:
        print ('IN FUNCTION compute_regexes() :'
               '\n\npat_file== %s'
               '\n\nsplitted_pat :\n%s'
               '\n\npat_parent_dir== %s\n') \
              % (pat_file , '\n'.join(splitted_pat) , pat_parent_dir)


    dgr = {}
    for i,el in enumerate(splitted_pat):
        if re.search('\(.*?\)',el):
            dgr[len(dgr)+1] = i
    if displ:
        print 'dgr :'
        print '\n'.join('group(%s) is in splitted_pat[%s]' % (g,i)
                        for g,i in dgr.iteritems())


    def repl(mat, dgr = dgr):
        the = int(mat.group(1) if mat.group(1) else mat.group(2))
        return str(the + dgr[the])

    for i,el in enumerate(splitted_pat):
        splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)


    pat_dirs = ''
    for x in splitted_pat[-2:0:-1]:
        pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
    pat_dirs = splitted_pat[0] + pat_dirs
    if displ:
        print '\npat_dirs==',pat_dirs

    return (re.compile(pat_file), re.compile(pat_dirs), re.compile(pat_parent_dir) )




pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file)

print '\n\nEXAMPLES with regx_file :\n'
print 'pat_file==',pat_file
for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru.rtf  ',
                 'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.jpeg  '):
    print filepath,bool(regx_file.match(filepath))

print '\n\nEXAMPLES with regx_dirs :\n'
for path in ('J:\\fooo',
             'J:\\fooo\\basil',
             'J:\\fooo\\basil\\ber92',
             'J:\\fooo\\basil\\ber92\\TURI777',
             'J:\\fooo\\basil\\ber92\\TURI9258',
             'J:\\froooooo'
             'J:\\froooooo\\ki_ki',
             'J:\\froooooo\\ki_ki\\bar',
             'J:\\froooooo\\ki=ki\\bar',
             'J:\\froooooo\\ki_ki\\bar\MONO47'):
    print path,("   : ~~ this dir's name is OK ~~" if path==''.join(regx_dirs.match(path).group())
                else "   : ## this dir's name doesn't match ##")

3 ) 最后,这里是函数

选择_walk() 这就完成了在树中搜索名称与特定正则表达式匹配的文件的任务:
它生成由内置的os.walk()函数返回的三元组(dirpath、dirnames、filenames),但只有目录filenames包含与pat\u file匹配的正确文件名的三元组

当然,在迭代过程中,函数select_walk()不会搜索那些文件内容永远不会与键regex模式pat_file匹配的目录,因为它们的(目录)名称

r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
'J:\\fooo\\plain\\bar999\\TURI99905'
'J:\\froooo\\one_dir\\bar25\\TURI2501'
'J:\\froooo\\one_dir\\bar25\\TURI2502'
'J:\\froooo\\one_dir\\ber\\MONO532'
import re


def compute_regexes(pat_file, displ = True):
    from os import sep

    splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)

    pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])

    if displ:
        print ('IN FUNCTION compute_regexes() :'
               '\n\npat_file== %s'
               '\n\nsplitted_pat :\n%s'
               '\n\npat_parent_dir== %s\n') \
              % (pat_file , '\n'.join(splitted_pat) , pat_parent_dir)


    dgr = {}
    for i,el in enumerate(splitted_pat):
        if re.search('\(.*?\)',el):
            dgr[len(dgr)+1] = i
    if displ:
        print 'dgr :'
        print '\n'.join('group(%s) is in splitted_pat[%s]' % (g,i)
                        for g,i in dgr.iteritems())


    def repl(mat, dgr = dgr):
        the = int(mat.group(1) if mat.group(1) else mat.group(2))
        return str(the + dgr[the])

    for i,el in enumerate(splitted_pat):
        splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)


    pat_dirs = ''
    for x in splitted_pat[-2:0:-1]:
        pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
    pat_dirs = splitted_pat[0] + pat_dirs
    if displ:
        print '\npat_dirs==',pat_dirs

    return (re.compile(pat_file), re.compile(pat_dirs), re.compile(pat_parent_dir) )




pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
regx_file, regx_dirs, regx_parent_dir = compute_regexes(pat_file)

print '\n\nEXAMPLES with regx_file :\n'
print 'pat_file==',pat_file
for filepath in ('J:\\fooo\\basil\\ber92\TURI9258\\beru.rtf  ',
                 'J:\\froooooo\\ki_ki\\bar\MONO47\\madrid.jpeg  '):
    print filepath,bool(regx_file.match(filepath))

print '\n\nEXAMPLES with regx_dirs :\n'
for path in ('J:\\fooo',
             'J:\\fooo\\basil',
             'J:\\fooo\\basil\\ber92',
             'J:\\fooo\\basil\\ber92\\TURI777',
             'J:\\fooo\\basil\\ber92\\TURI9258',
             'J:\\froooooo'
             'J:\\froooooo\\ki_ki',
             'J:\\froooooo\\ki_ki\\bar',
             'J:\\froooooo\\ki=ki\\bar',
             'J:\\froooooo\\ki_ki\\bar\MONO47'):
    print path,("   : ~~ this dir's name is OK ~~" if path==''.join(regx_dirs.match(path).group())
                else "   : ## this dir's name doesn't match ##")
IN FUNCTION compute_regexes() :

pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)

splitted_pat :
J:
f[ruv]?o+
\w+
b[ae]r(\d+)?
(?(1)TURI\1\d*|MONO\d+)
\w+\.(dat|rtf|jpeg)

pat_parent_dir== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)

dgr :
group(1) is in splitted_pat[3]
group(2) is in splitted_pat[4]
group(3) is in splitted_pat[5]

pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?


EXAMPLES with regx_file :

pat_file== J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)
J:\fooo\basil\ber92\TURI9258\beru.rtf   True
J:\froooooo\ki_ki\bar\MONO47\madrid.jpeg   True


EXAMPLES with regx_dirs :

J:\fooo    : ~~ this dir's name is OK ~~
J:\fooo\basil    : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92    : ~~ this dir's name is OK ~~
J:\fooo\basil\ber92\TURI777    : ## this dir's name doesn't match ##
J:\fooo\basil\ber92\TURI9258    : ~~ this dir's name is OK ~~
J:\frooooooJ:\froooooo\ki_ki    : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar    : ~~ this dir's name is OK ~~
J:\froooooo\ki=ki\bar    : ## this dir's name doesn't match ##
J:\froooooo\ki_ki\bar\MONO47    : ~~ this dir's name is OK ~~
def select_walk(pat_file,start_dir):

    from os import sep

    splitted_pat = re.split(r'\\\\' if sep=='\\' else '/', pat_file)

    pat_parent_dir = (r'\\' if sep=='\\' else '/').join(splitted_pat[0:-1])

    dgr = {}
    for i,el in enumerate(splitted_pat):
        if re.search('\(.*?\)',el):
            dgr[len(dgr)+1] = i

    def repl(mat, dgr = dgr):
        the = int(mat.group(1) if mat.group(1) else mat.group(2))
        return str(the + dgr[the])

    for i,el in enumerate(splitted_pat):
        splitted_pat[i] = re.sub(r'(?<=\(\?\()(\d+)(?=\))|(?<=\\)(\d+)',repl,el)

    pat_dirs = ''
    for x in splitted_pat[-2:0:-1]:
        pat_dirs = r'(?=\\|\Z)(\\%s%s)?' % (x,pat_dirs)
    pat_dirs = splitted_pat[0] + pat_dirs
    print 'pat_dirs==',pat_dirs

    regx_file = re.compile(pat_file)
    regx_dirs = re.compile(pat_dirs)
    regx_parent_dir = re.compile(pat_parent_dir)

    start_dir = start_dir.rstrip(sep) + sep
    print '\nstart_dir == '+start_dir

    for dirpath,dirnames,filenames in os.walk(start_dir):

        dirpath = dirpath.rstrip(sep)
        print '\n'.join(('explored dirpath : %s    is_direct_parent: %s' \
                         % (dirpath,('NO','YES')[bool(regx_parent_dir.match(dirpath))]),
                         '           dirnames  : %s' % dirnames,
                         '          filenames  : %s' % filenames))

        if regx_parent_dir.match(dirpath):
            filenames[:] = [filename for filename in filenames
                            if regx_file.match(dirpath + sep + filename)]
            dirnames[:] = []
            print '\n'.join(('           dirnames  : not to be explored ' ,
                             '  yielded filenames  : %s\n' % filenames)) 
            yield (dirpath,dirnames,filenames)

        else:
            dirnames[:] = [dirname for dirname in dirnames
                           if regx_dirs.match(dirpath + sep + dirname).group()==dirpath + sep + dirname]
            print '\n'.join(('dirnames to explore  : %s ' % dirnames,
                             '          filenames  : not to be yielded\n')) 




pat_file = r'J:\\f[ruv]?o+\\\w+\\b[ae]r(\d+)?\\(?(1)TURI\1\d*|MONO\d+)\\\w+\.(dat|rtf|jpeg)'
print '\n\nSELECTED (dirpath, dirnames, filenames) :\n' + '\n'.join(map(repr, select_walk(pat_file,'J:\\')))
pat_dirs== J:(?=\\|\Z)(\\f[ruv]?o+(?=\\|\Z)(\\\w+(?=\\|\Z)(\\b[ae]r(\d+)?(?=\\|\Z)(\\(?(4)TURI\4\d*|MONO\d+))?)?)?)?

start_dir == J:\
explored dirpath : J:    is_direct_parent: NO
           dirnames  : ['Amazon', 'faooo', 'Favorites', 'foo', 'fooo', 'froooo', 'Python', 'RECYCLER', 'System Volume Information']
          filenames  : ['image00.pfm', 'rep.py']
dirnames to explore  : ['foo', 'fooo', 'froooo'] 
          filenames  : not to be yielded

explored dirpath : J:\foo    is_direct_parent: NO
           dirnames  : ['basil', 'poto%', 'tamata']
          filenames  : ['kalaomi.xls']
dirnames to explore  : ['basil', 'tamata'] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil    is_direct_parent: NO
           dirnames  : ['ber300', 'ber89']
          filenames  : []
dirnames to explore  : ['ber300', 'ber89'] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil\ber300    is_direct_parent: NO
           dirnames  : []
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\foo\basil\ber89    is_direct_parent: NO
           dirnames  : ['TURI1023', 'TURI850']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\foo\tamata    is_direct_parent: NO
           dirnames  : ['vahine']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\fooo    is_direct_parent: NO
           dirnames  : ['atlantis', 'plain', 'york#']
          filenames  : []
dirnames to explore  : ['atlantis', 'plain'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\atlantis    is_direct_parent: NO
           dirnames  : ['atlABC', 'atlDEFG']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain    is_direct_parent: NO
           dirnames  : ['bar999', 'ws89rt', 'zx13ao']
          filenames  : []
dirnames to explore  : ['bar999'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain\bar999    is_direct_parent: NO
           dirnames  : ['MONO2', 'TURI2227', 'TURI99905']
          filenames  : []
dirnames to explore  : ['TURI99905'] 
          filenames  : not to be yielded

explored dirpath : J:\fooo\plain\bar999\TURI99905    is_direct_parent: YES
           dirnames  : ['AERIAL', 'minidisc']
          filenames  : ['concrete.txt', 'galileo.jpeg', 'polynesia.dat']
           dirnames  : not to be explored 
  yielded filenames  : ['galileo.jpeg', 'polynesia.dat']

explored dirpath : J:\froooo    is_direct_parent: NO
           dirnames  : ['another_dir', 'one_dir']
          filenames  : []
dirnames to explore  : ['another_dir', 'one_dir'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\another_dir    is_direct_parent: NO
           dirnames  : ['notseen', 'notseen2']
          filenames  : []
dirnames to explore  : [] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir    is_direct_parent: NO
           dirnames  : ['bar25', 'ber']
          filenames  : ['photo in one_dir.jpeg', 'tabula.xls']
dirnames to explore  : ['bar25', 'ber'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\bar25    is_direct_parent: NO
           dirnames  : ['MONO8', 'TURI2501', 'TURI2502', 'TURI4813']
          filenames  : []
dirnames to explore  : ['TURI2501', 'TURI2502'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\bar25\TURI2501    is_direct_parent: YES
           dirnames  : []
          filenames  : ['beretta.xls', 'italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']
           dirnames  : not to be explored 
  yielded filenames  : ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf']

explored dirpath : J:\froooo\one_dir\bar25\TURI2502    is_direct_parent: YES
           dirnames  : []
          filenames  : ['adamante.jpeg', 'egyptic.txt', 'urubu.rtf']
           dirnames  : not to be explored 
  yielded filenames  : ['adamante.jpeg', 'urubu.rtf']

explored dirpath : J:\froooo\one_dir\ber    is_direct_parent: NO
           dirnames  : ['MONO532', 'TURI', 'TURI30']
          filenames  : []
dirnames to explore  : ['MONO532'] 
          filenames  : not to be yielded

explored dirpath : J:\froooo\one_dir\ber\MONO532    is_direct_parent: YES
           dirnames  : []
          filenames  : ['bacillus.jpeg', 'blueberry.dat', 'Perfume.doc']
           dirnames  : not to be explored 
  yielded filenames  : ['bacillus.jpeg', 'blueberry.dat']



SELECTED (dirpath, dirnames, filenames) :
('J:\\fooo\\plain\\bar999\\TURI99905', [], ['galileo.jpeg', 'polynesia.dat'])
('J:\\froooo\\one_dir\\bar25\\TURI2501', [], ['italy.dat', 'matallelo.jpeg', 'turi2501_ser.rtf'])
('J:\\froooo\\one_dir\\bar25\\TURI2502', [], ['adamante.jpeg', 'urubu.rtf'])
('J:\\froooo\\one_dir\\ber\\MONO532', [], ['bacillus.jpeg', 'blueberry.dat'])