Python 使用正则表达式替换文本文件中括号内的对象_Python_Regex

Python 使用正则表达式替换文本文件中括号内的对象

python regex

Python 使用正则表达式替换文本文件中括号内的对象,python,regex,Python,Regex,我有一个打开的文本文件，f。我需要找到包含文本的方括号的每个实例，包括括号。例如，与-- 它将匹配/打印： 1 - [First] 3 - [Finally] 3 - [B] 一旦我打印了这些匹配项，我想删除它们并规范化任何多余的空白，因此最终文本将是： 1 - This is the line 2 - (And) another line 3 - the last 该函数在概念上看起来是这样的，尽管我在处理它的regex部分时遇到了问题： def find_and_replace(file

我有一个打开的文本文件，f。我需要找到包含文本的方括号的每个实例，包括括号。例如，与--

它将匹配/打印：

1 - [First]
3 - [Finally]
3 - [B]

一旦我打印了这些匹配项，我想删除它们并规范化任何多余的空白，因此最终文本将是：

1 - This is the line
2 - (And) another line
3 - the last

该函数在概念上看起来是这样的，尽管我在处理它的regex部分时遇到了问题：

def find_and_replace(file):
    f=open(file)
    regex = re.compile("[.+]")
    find regex.all
    for item in regex.all:
        print item, line-number
        replace(item, '')
        normalize white space

谢谢。

在正则表达式前面，

“[.+]”

将创建一个与

或

匹配的字符类。您需要转义

和

字符，因为它们在正则表达式中具有特殊意义。此外，这将匹配字符串，如

[a]foo[b]

，因为默认情况下量词是贪婪的。在

后面添加一个

？

，告诉它匹配尽可能短的字符序列

所以试试

“\\[.+？\\]”

，看看这是否有效

如果您还想查找并删除

[]

，那么将

量词替换为正则表达式前面的

，

“[.+]”

将创建一个与

或

匹配的字符类。您需要转义

和

字符，因为它们在正则表达式中具有特殊意义。此外，这将匹配字符串，如

[a]foo[b]

，因为默认情况下量词是贪婪的。在

后面添加一个

？

，告诉它匹配尽可能短的字符序列

所以试试

“\\[.+？\\]”

，看看这是否有效

如果还想查找并删除

[]

，则将

量词替换为

正则表达式：

re.findall('\[[^\]]+\]', 'foo [bar] baz')

收益率：

['[bar]']

因此：

应该适用于您

正则表达式：

re.findall('\[[^\]]+\]', 'foo [bar] baz')

收益率：

['[bar]']

因此：

应该适合您

您必须退出

[]

字符并使用非贪婪运算符

r'\[.+?\]'

注意：使用正则表达式，您将无法使用嵌套括号，如

[foo[bar]]

另外，要删除额外的空格，请在正则表达式的末尾添加

\s？

例如：

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last

您必须转义

[]

字符并使用非贪婪运算符

r'\[.+?\]'

注意：使用正则表达式，您将无法使用嵌套括号，如

[foo[bar]]

另外，要删除额外的空格，请在正则表达式的末尾添加

\s？

例如：

>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last

使用JBernardo的正则表达式，要显示每次删除带括号的字符串块时的行及其编号，请执行以下操作：

import re

ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''

print ss,'\n'

dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))

def repl(mat, countline =[1]):
    if mat.group(1):
        print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
        countline[0] += 1
        return mat.group(1)
    else:
        print "line %s: removing %10s  in  %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
        return ''

print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)

导致

When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ  ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:— 

line 1: removing  '[xxxx] '  in  'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing    '[yyy]'  in  "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ  ] '  in  'Behind the gateways[ZZZZ  ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing    '[AAA]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] '  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing   '[BBBB]'  in  'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'

When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—

但是正如JBernardo指出的那样，如果字符串中有嵌套的括号，那么这个正则表达式就会出现问题：

ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)

产生

one ] end of line

如果修改了regex的模式，则只会删除嵌套较多的方括号块：

ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)

给予

因此，我搜索了各种子类的解决方案，以防您也希望处理所有嵌套的括号中的字符串块。
由于正则表达式不是解析器，我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块，以逐步移除其中几个级别嵌套中的所有括号内块

子类别1 简单地删除嵌套的带括号的块：

import re

ss = '''This is the [first]       line   
(And) another line
   [Inter][A] initially shifted
[Finally][B] the last
    Additional ending lines (this one without brackets):    
[Note that [ by the way [ref [ 1]] there are]    [some] other ]cases
tuvulu[]gusti perena[3]              bdiiii
    [Away [is this] [][4] ] shifted content
    fgjezhr][fgh
'''

def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub('\\1',x)
    return x


print '\n==========================\n'+clean(ss)

您可以注意到，对于两个初始行，它仍然是空白的：

   [Inter][A] initially shifted
    [Away [is this] [][4] ] shifted content

转化为

 initially shifted
 shifted content

子类别2：因此，我改进了正则表达式和算法，以清除这些行开头的所有第一个空格

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
    def repl(mat):
        return '' if mat.group(1) else mat.group(2)
    while regx.search(x):
        print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
        x = regx.sub(repl,x)
    return x


print '\n==========================\n'+clean(ss)

开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格，那么最好在所有行上执行strip（），这样您就不需要此解决方案，前一个解决方案就足够了

子类别3：要添加执行删除的行的显示，现在需要在代码中进行修改，以考虑我们执行的迭代：

在迭代的每一轮中，线条都会逐渐变化，我们不能使用常量dico_线条

此外，在迭代的每一轮，行的计数器必须向下移动到1

为了获得这两种自适应，我使用了一种技巧：修改替换函数的函数默认值

import re ss = '''This is the [first] line (And) another line [Inter][A] initially shifted [Finally][B] the last Additional ending lines (this one without brackets): [Note that [ by the way [ref [ 1]] there are] [some] other ]cases tuvulu[]gusti perena[3] bdiiii [Away [is this] [][4] ] shifted content fgjezhr][fgh ''' def clean(x, rag = re.compile('\[.*\]',re.MULTILINE), regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)): def repl(mat, cnt = None, dico_lignes = None): if mat.group(1): print "line %s: detecting %s ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1) cnt[0] += 1 return mat.group(1) if mat.group(4): print "line %s: removing %s IN %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]]) return '' if mat.group(2) else mat.group(3) while rag.search(x): print '\n--------------------------\n'+x repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1))) x = regx.sub(repl,x) return x print '\n==========================\n'+clean(ss)

使用JBernardo的正则表达式，要显示每次删除带括号的字符串块时的行及其编号，请执行以下操作：

import re ss = '''When colour goes [xxxx] home into the eyes, And lights that shine are shut again, With danc[yyy]ing girls and sweet birds' cries Behind the gateways[ZZZZ ] of the brain; And that no-place which gave them birth, shall close The [AAA]rainbow [UUUUU] and [BBBB]the rose:—''' print ss,'\n' dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1)) def repl(mat, countline =[1]): if mat.group(1): print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1) countline[0] += 1 return mat.group(1) else: print "line %s: removing %10s in %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]]) return '' print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
导致

When colour goes [xxxx] home into the eyes, And lights that shine are shut again, With danc[yyy]ing girls and sweet birds' cries Behind the gateways[ZZZZ ] of the brain; And that no-place which gave them birth, shall close The [AAA]rainbow [UUUUU] and [BBBB]the rose:— line 1: removing '[xxxx] ' in 'When colour goes [xxxx] home into the eyes,\n' line 1: detecting \n , the counter of lines is incremented -> 2 line 2: detecting \n , the counter of lines is incremented -> 3 line 3: removing '[yyy]' in "With danc[yyy]ing girls and sweet birds' cries\n" line 3: detecting \n , the counter of lines is incremented -> 4 line 4: removing '[ZZZZ ] ' in 'Behind the gateways[ZZZZ ] of the brain;\n' line 4: detecting \n , the counter of lines is incremented -> 5 line 5: detecting \n , the counter of lines is incremented -> 6 line 6: removing '[AAA]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97' line 6: removing '[UUUUU] ' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97' line 6: removing '[BBBB]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97' When colour goes home into the eyes, And lights that shine are shut again, With dancing girls and sweet birds' cries Behind the gatewaysof the brain; And that no-place which gave them birth, shall close The rainbow and the rose:—
但是正如JBernardo指出的那样，如果字符串中有嵌套的括号，那么这个正则表达式就会出现问题：

ss = 'one [two [three] ] end of line' print re.sub(r'\[.+?\]\s?','',ss)
产生

one ] end of line
如果修改了regex的模式，则只会删除嵌套较多的方括号块：

ss = 'one [two [three] ] end of line' print re.sub(r'\[[^\][]*\]\s?','',ss)
给予

因此，我搜索了各种子类的解决方案，以防您也希望处理所有嵌套的括号中的字符串块。
由于正则表达式不是解析器，我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块，以逐步移除其中几个级别嵌套中的所有括号内块

子类别1 简单地删除嵌套的带括号的块：

import re ss = '''This is the [first] line (And) another line [Inter][A] initially shifted [Finally][B] the last Additional ending lines (this one without brackets): [Note that [ by the way [ref [ 1]] there are] [some] other ]cases tuvulu[]gusti perena[3] bdiiii [Away [is this] [][4] ] shifted content fgjezhr][fgh ''' def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')): while regx.search(x): print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x))) x = regx.sub('\\1',x) return x print '\n==========================\n'+clean(ss)
您可以注意到，对于两个初始行，它仍然是空白的：

[Inter][A] initially shifted [Away [is this] [][4] ] shifted content
转化为

initially shifted shifted content
子类别2：因此，我改进了正则表达式和算法，以清除这些行开头的所有第一个空格

def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)): def repl(mat): return '' if mat.group(1) else mat.group(2) while regx.search(x): print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x))) x = regx.sub(repl,x) return x print '\n==========================\n'+clean(ss)
开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格，那么最好在所有行上执行strip（），这样您就不需要此解决方案，前一个解决方案就足够了
子类别3：要添加执行删除操作的行的显示，现在需要对代码进行修改以获取accoun