Python 使用正则表达式替换文本文件中括号内的对象
我有一个打开的文本文件,f。我需要找到包含文本的方括号的每个实例,包括括号。例如,与-- 它将匹配/打印:Python 使用正则表达式替换文本文件中括号内的对象,python,regex,Python,Regex,我有一个打开的文本文件,f。我需要找到包含文本的方括号的每个实例,包括括号。例如,与-- 它将匹配/打印: 1 - [First] 3 - [Finally] 3 - [B] 一旦我打印了这些匹配项,我想删除它们并规范化任何多余的空白,因此最终文本将是: 1 - This is the line 2 - (And) another line 3 - the last 该函数在概念上看起来是这样的,尽管我在处理它的regex部分时遇到了问题: def find_and_replace(file
1 - [First]
3 - [Finally]
3 - [B]
一旦我打印了这些匹配项,我想删除它们并规范化任何多余的空白,因此最终文本将是:
1 - This is the line
2 - (And) another line
3 - the last
该函数在概念上看起来是这样的,尽管我在处理它的regex部分时遇到了问题:
def find_and_replace(file):
f=open(file)
regex = re.compile("[.+]")
find regex.all
for item in regex.all:
print item, line-number
replace(item, '')
normalize white space
谢谢。在正则表达式前面,“[.+]”
将创建一个与
或+
匹配的字符类。您需要转义[
和]
字符,因为它们在正则表达式中具有特殊意义。此外,这将匹配字符串,如[a]foo[b]
,因为默认情况下量词是贪婪的。在+
后面添加一个?
,告诉它匹配尽可能短的字符序列
所以试试“\\[.+?\\]”
,看看这是否有效
如果您还想查找并删除[]
,那么将+
量词替换为正则表达式前面的*
,“[.+]”
将创建一个与
或+
匹配的字符类。您需要转义[
和]
字符,因为它们在正则表达式中具有特殊意义。此外,这将匹配字符串,如[a]foo[b]
,因为默认情况下量词是贪婪的。在+
后面添加一个?
,告诉它匹配尽可能短的字符序列
所以试试“\\[.+?\\]”
,看看这是否有效
如果还想查找并删除[]
,则将+
量词替换为*
正则表达式:
re.findall('\[[^\]]+\]', 'foo [bar] baz')
收益率:
['[bar]']
因此:
应该适用于您正则表达式:
re.findall('\[[^\]]+\]', 'foo [bar] baz')
收益率:
['[bar]']
因此:
应该适合您您必须退出
[]
字符并使用非贪婪运算符
r'\[.+?\]'
注意:使用正则表达式,您将无法使用嵌套括号,如[foo[bar]]
另外,要删除额外的空格,请在正则表达式的末尾添加\s?
例如:
>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last
您必须转义
[]
字符并使用非贪婪运算符
r'\[.+?\]'
注意:使用正则表达式,您将无法使用嵌套括号,如[foo[bar]]
另外,要删除额外的空格,请在正则表达式的末尾添加\s?
例如:
>>> a = '''1 - This is the [first] line
2 - (And) another line
3 - [Finally][B] the last
'''
>>> a = re.sub(r'\[.+?\]\s?','',a)
>>> print(a)
1 - This is the line
2 - (And) another line
3 - the last
使用JBernardo的正则表达式,要显示每次删除带括号的字符串块时的行及其编号,请执行以下操作:
import re
ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''
print ss,'\n'
dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))
def repl(mat, countline =[1]):
if mat.group(1):
print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
countline[0] += 1
return mat.group(1)
else:
print "line %s: removing %10s in %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
return ''
print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
导致
When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—
line 1: removing '[xxxx] ' in 'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing '[yyy]' in "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ ] ' in 'Behind the gateways[ZZZZ ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing '[AAA]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] ' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[BBBB]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
但是正如JBernardo指出的那样,如果字符串中有嵌套的括号,那么这个正则表达式就会出现问题:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
产生
one ] end of line
如果修改了regex的模式,则只会删除嵌套较多的方括号块:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
给予
因此,我搜索了各种子类的解决方案,以防您也希望处理所有嵌套的括号中的字符串块。由于正则表达式不是解析器,我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块,以逐步移除其中几个级别嵌套中的所有括号内块 子类别1 简单地删除嵌套的带括号的块:
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub('\\1',x)
return x
print '\n==========================\n'+clean(ss)
您可以注意到,对于两个初始行,它仍然是空白的:
[Inter][A] initially shifted
[Away [is this] [][4] ] shifted content
转化为
initially shifted
shifted content
子类别2:
因此,我改进了正则表达式和算法,以清除这些行开头的所有第一个空格
def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
def repl(mat):
return '' if mat.group(1) else mat.group(2)
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格,那么最好在所有行上执行strip(),这样您就不需要此解决方案,前一个解决方案就足够了
子类别3:
要添加执行删除的行的显示,现在需要在代码中进行修改,以考虑我们执行的迭代:
- 在迭代的每一轮中,线条都会逐渐变化,我们不能使用常量dico_线条
- 此外,在迭代的每一轮,行的计数器必须向下移动到1
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, rag = re.compile('\[.*\]',re.MULTILINE),
regx = re.compile('(\n)|(?=^( ))?( |(?<! ))+((?<!])\[[^[\]\n]*\])( *)',re.MULTILINE)):
def repl(mat, cnt = None, dico_lignes = None):
if mat.group(1):
print "line %s: detecting %s ==> count incremented to %s" % (cnt[0],str(mat.groups('')),cnt[0]+1)
cnt[0] += 1
return mat.group(1)
if mat.group(4):
print "line %s: removing %s IN %s" % (cnt[0],repr(mat.group(4)),dico_lignes[cnt[0]])
return '' if mat.group(2) else mat.group(3)
while rag.search(x):
print '\n--------------------------\n'+x
repl.func_defaults = ([1],dict( (n,repr(line)) for n,line in enumerate(x.splitlines(True),1)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
使用JBernardo的正则表达式,要显示每次删除带括号的字符串块时的行及其编号,请执行以下操作:
import re
ss = '''When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—'''
print ss,'\n'
dico_lines = dict( (n,repr(line)) for n,line in enumerate(ss.splitlines(True),1))
def repl(mat, countline =[1]):
if mat.group(1):
print "line %s: detecting \\n , the counter of lines is incremented -> %s" % (countline[0],countline[0]+1)
countline[0] += 1
return mat.group(1)
else:
print "line %s: removing %10s in %s" % (countline[0],repr(mat.group()),dico_lines[countline[0]])
return ''
print '\n'+re.sub(r'(\n)|\[.*?\] ?',repl,ss)
导致
When colour goes [xxxx] home into the eyes,
And lights that shine are shut again,
With danc[yyy]ing girls and sweet birds' cries
Behind the gateways[ZZZZ ] of the brain;
And that no-place which gave them birth, shall close
The [AAA]rainbow [UUUUU] and [BBBB]the rose:—
line 1: removing '[xxxx] ' in 'When colour goes [xxxx] home into the eyes,\n'
line 1: detecting \n , the counter of lines is incremented -> 2
line 2: detecting \n , the counter of lines is incremented -> 3
line 3: removing '[yyy]' in "With danc[yyy]ing girls and sweet birds' cries\n"
line 3: detecting \n , the counter of lines is incremented -> 4
line 4: removing '[ZZZZ ] ' in 'Behind the gateways[ZZZZ ] of the brain;\n'
line 4: detecting \n , the counter of lines is incremented -> 5
line 5: detecting \n , the counter of lines is incremented -> 6
line 6: removing '[AAA]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[UUUUU] ' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
line 6: removing '[BBBB]' in 'The [AAA]rainbow [UUUUU] and [BBBB]the rose:\x97'
When colour goes home into the eyes,
And lights that shine are shut again,
With dancing girls and sweet birds' cries
Behind the gatewaysof the brain;
And that no-place which gave them birth, shall close
The rainbow and the rose:—
但是正如JBernardo指出的那样,如果字符串中有嵌套的括号,那么这个正则表达式就会出现问题:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[.+?\]\s?','',ss)
产生
one ] end of line
如果修改了regex的模式,则只会删除嵌套较多的方括号块:
ss = 'one [two [three] ] end of line'
print re.sub(r'\[[^\][]*\]\s?','',ss)
给予
因此,我搜索了各种子类的解决方案,以防您也希望处理所有嵌套的括号中的字符串块。由于正则表达式不是解析器,我们不能在不进行迭代的情况下移除包含嵌套括号内块的括号内块,以逐步移除其中几个级别嵌套中的所有括号内块 子类别1 简单地删除嵌套的带括号的块:
import re
ss = '''This is the [first] line
(And) another line
[Inter][A] initially shifted
[Finally][B] the last
Additional ending lines (this one without brackets):
[Note that [ by the way [ref [ 1]] there are] [some] other ]cases
tuvulu[]gusti perena[3] bdiiii
[Away [is this] [][4] ] shifted content
fgjezhr][fgh
'''
def clean(x, regx = re.compile('( |(?<! ))+((?<!])\[[^[\]]*\])( *)')):
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub('\\1',x)
return x
print '\n==========================\n'+clean(ss)
您可以注意到,对于两个初始行,它仍然是空白的:
[Inter][A] initially shifted
[Away [is this] [][4] ] shifted content
转化为
initially shifted
shifted content
子类别2:
因此,我改进了正则表达式和算法,以清除这些行开头的所有第一个空格
def clean(x, regx = re.compile('(?=^( ))?( |(?<! ))+((?<!])\[[^[\]]*\])( )*',re.MULTILINE)):
def repl(mat):
return '' if mat.group(1) else mat.group(2)
while regx.search(x):
print '------------\n',x,'\n','\n'.join(map(str,regx.findall(x)))
x = regx.sub(repl,x)
return x
print '\n==========================\n'+clean(ss)
开头有空格但没有正确的括号块的行保持不变。如果您也希望消除这些行中的起始空格,那么最好在所有行上执行strip(),这样您就不需要此解决方案,前一个解决方案就足够了
子类别3:
要添加执行删除操作的行的显示,现在需要对代码进行修改以获取accoun