Python 使用正则表达式替换文本文件中括号内的对象

Python 使用正则表达式替换文本文件中括号内的对象,python,regex,Python,Regex,我有一个打开的文本文件,f。我需要找到包含文本的方括号的每个实例,包括括号。例如,与-- 它将匹配/打印: 1 - [First] 3 - [Finally] 3 - [B] 一旦我打印了这些匹配项,我想删除它们并规范化任何多余的空白,因此最终文本将是: 1 - This is the line 2 - (And) another line 3 - the last 该函数在概念上看起来是这样的,尽管我在处理它的regex部分时遇到了问题: def find_and_replace(file

我有一个打开的文本文件,f。我需要找到包含文本的方括号的每个实例,包括括号。例如,与--

它将匹配/打印:

1 - [First]
3 - [Finally]
3 - [B]
一旦我打印了这些匹配项,我想删除它们并规范化任何多余的空白,因此最终文本将是:

1 - This is the line
2 - (And) another line
3 - the last
该函数在概念上看起来是这样的,尽管我在处理它的regex部分时遇到了问题:

def find_and_replace(file):
    f=open(file)
    regex = re.compile("[.+]")
    find regex.all
    for item in regex.all:
        print item, line-number
        replace(item, '')
        normalize white space
谢谢。

在正则表达式前面,
“[.+]”
将创建一个与
+
匹配的字符类。您需要转义
[
]
字符,因为它们在正则表达式中具有特殊意义。此外,这将匹配字符串,如
[a]foo[b]
,因为默认情况下量词是贪婪的。在
+
后面添加一个
,告诉它匹配尽可能短的字符序列

所以试试
“\\[.+?\\]”
,看看这是否有效

如果您还想查找并删除
[]
,那么将
+
量词替换为正则表达式前面的
*

“[.+]”
将创建一个与
+
匹配的字符类。您需要转义
[
]
字符,因为它们在正则表达式中具有特殊意义。此外,这将匹配字符串,如
[a]foo[b]
,因为默认情况下量词是贪婪的。在
+
后面添加一个
,告诉它匹配尽可能短的字符序列

所以试试
“\\[.+?\\]”
,看看这是否有效

如果还想查找并删除
[]
,则将
+
量词替换为
*
正则表达式:

re.findall('\[[^\]]+\]', 'foo [bar] baz')
收益率:

['[bar]']
因此:

应该适用于您

正则表达式:

re.findall('\[[^\]]+\]', 'foo [bar] baz')
收益率:

['[bar]']
因此:


应该适合您

您必须退出
[]
字符并使用非贪婪运算符

r'\[.+?\]'
注意:使用正则表达式,您将无法使用嵌套括号,如
[foo[bar]]

另外,要删除额外的空格,请在正则表达式的末尾添加
\s?

例如:

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last

您必须转义
[]
字符并使用非贪婪运算符

r'\[.+?\]'
注意:使用正则表达式,您将无法使用嵌套括号,如
[foo[bar]]

另外,要删除额外的空格,请在正则表达式的末尾添加
\s?

例如:

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last

使用JBernardo的正则表达式,要显示每次删除带括号的字符串块时的行及其编号,请执行以下操作:

import re

ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''

print ss,'\n'

dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))

def repl(mat, countline =[1]):
    if mat.group(1):
        print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
        countline[0] += 1
        return mat.group(1)
    else:
        print "line %s: removing %10s  in  %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
        return ''

print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
导致

When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:— 

line 1: removing  '[xxxx] '  in  'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing    '[yyy]'  in  "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ  ] '  in  'Behind the gateways[ZZZZ  ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing    '[AAA]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] '  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing   '[BBBB]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'

When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
但是正如JBernardo指出的那样,如果字符串中有嵌套的括号,那么这个正则表达式就会出现问题:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
产生

one ] end of line
如果修改了regex的模式,则只会删除嵌套较多的方括号块:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
给予

因此,我搜索了各种子类的解决方案,以防您也希望处理所有嵌套的括号中的字符串块。
由于正则表达式不是解析器,我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块,以逐步移除其中几个级别嵌套中的所有括号内块

子类别1 简单地删除嵌套的带括号的块:

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub('\\1',x)
    return x


print '\n==========================\n'+clean(ss)
您可以注意到,对于两个初始行,它仍然是空白的:

   [Inter][A] initially shifted
    [Away [is this] [][4] ] shifted content
转化为

 initially shifted
 shifted content
子类别2: 因此,我改进了正则表达式和算法,以清除这些行开头的所有第一个空格

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
    def repl(mat):
        return '' if mat.group(1) else mat.group(2)
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)
开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格,那么最好在所有行上执行strip(),这样您就不需要此解决方案,前一个解决方案就足够了

子类别3: 要添加执行删除的行的显示,现在需要在代码中进行修改,以考虑我们执行的迭代:

  • 在迭代的每一轮中,线条都会逐渐变化,我们不能使用常量dico_线条

  • 此外,在迭代的每一轮,行的计数器必须向下移动到1

为了获得这两种自适应,我使用了一种技巧:修改替换函数的函数默认值

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
          regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):

    def repl(mat, cnt = None, dico_lignes = None):
        if mat.group(1):
            print "line %s: detecting %s  ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
            cnt[0] += 1
            return mat.group(1)
        if mat.group(4):
            print "line %s: removing %s   IN   %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
            return '' if mat.group(2) else mat.group(3)

    while rag.search(x):
        print '\n--------------------------\n'+x
        repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

使用JBernardo的正则表达式,要显示每次删除带括号的字符串块时的行及其编号,请执行以下操作:

import re

ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''

print ss,'\n'

dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))

def repl(mat, countline =[1]):
    if mat.group(1):
        print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
        countline[0] += 1
        return mat.group(1)
    else:
        print "line %s: removing %10s  in  %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
        return ''

print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
导致

When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:— 

line 1: removing  '[xxxx] '  in  'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing    '[yyy]'  in  "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ  ] '  in  'Behind the gateways[ZZZZ  ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing    '[AAA]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] '  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing   '[BBBB]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'

When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
但是正如JBernardo指出的那样,如果字符串中有嵌套的括号,那么这个正则表达式就会出现问题:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
产生

one ] end of line
如果修改了regex的模式,则只会删除嵌套较多的方括号块:

ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
给予

因此,我搜索了各种子类的解决方案,以防您也希望处理所有嵌套的括号中的字符串块。
由于正则表达式不是解析器,我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块,以逐步移除其中几个级别嵌套中的所有括号内块

子类别1 简单地删除嵌套的带括号的块:

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub('\\1',x)
    return x


print '\n==========================\n'+clean(ss)
您可以注意到,对于两个初始行,它仍然是空白的:

   [Inter][A] initially shifted
    [Away [is this] [][4] ] shifted content
转化为

 initially shifted
 shifted content
子类别2: 因此,我改进了正则表达式和算法,以清除这些行开头的所有第一个空格

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
    def repl(mat):
        return '' if mat.group(1) else mat.group(2)
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)
开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格,那么最好在所有行上执行strip(),这样您就不需要此解决方案,前一个解决方案就足够了

子类别3: 要添加执行删除操作的行的显示,现在需要对代码进行修改以获取accoun