Python 如何通过正则表达式捕获特定标记内的所有标记?

Python 如何通过正则表达式捕获特定标记内的所有标记?,python,regex,Python,Regex,例如,有这样一个代码 <tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1> sometextsometextsometextsometext 我想做的是让它像 <tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX&g

例如,有这样一个代码

<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>
sometextsometextsometextsometext
我想做的是让它像

<tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext</tag1>
sometextsometextsometextsometext
我使用regex进行搜索(它也可以与Notepad++和Python的re.compile函数一起使用)

(]*>.*?(]*>.*?)
和用于更换(它也适用于re.sub)

\1\2\3
但它只发现并改变了第一次发生的事情,而不是所有的事情

<tag1 blablablah>sometext<XXX><i></XXX>sometext</i>sometext<i>sometext</i>sometext</tag1>
sometextsometextsometextsometext

有人能帮我吗?

试试这个

<((?:[a-z]+:)?[a-z]\w+)\b[^<>]+?>(.+)</\1>
(.+)
解释

"
<              # Match the character “<” literally
(              # Match the regular expression below and capture its match into backreference number 1
   (?:            # Match the regular expression below
      [a-z]          # Match a single character in the range between “a” and “z”
         +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      :              # Match the character “:” literally
   )?             # Between zero and one times, as many times as possible, giving back as needed (greedy)
   [a-z]          # Match a single character in the range between “a” and “z”
   \w             # Match a single character that is a “word character” (letters, digits, and underscores)
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary
[^<>]          # Match a single character NOT present in the list “<>”
   +?             # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
>              # Match the character “>” literally
(              # Match the regular expression below and capture its match into backreference number 2
   .              # Match any single character that is not a line break character
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
</             # Match the characters “</” literally
\1             # Match the same text as most recently matched by capturing group number 1
>              # Match the character “>” literally
"
”
<#按字面意思匹配字符“#按字面意思匹配字符”>
(#匹配下面的正则表达式,并将其匹配捕获到backreference 2中
.#匹配任何非换行字符的单个字符
+#在一次和无限次之间,尽可能多次,根据需要回馈(贪婪)
)
“真的吗
"

问题在于避免使用第一个和最后一个标记。如果你把它们分开,那么很简单:

s = '<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>'
start, end = s.find('>') + 1, s.rfind('<')
s_list = [s[:start], s[start:end], s[end:]]
s_list[1] = re.sub(r'(<[^>]*>)', r'<XXX>\1</XXX>', s_list[1])
print ''.join(s_list)
s='sometextsometextsometext'
开始,结束=s.find('>')+1,s.rfind(')',r'\1',s_列表[1])
打印“”。加入(s_列表)
不过,这不是一条单行线

或者,您可以执行以下操作:

print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)
print re.sub(r'([^(^(?!$))',r'\1\2',s)

请注意,仅当最外层的标记位于字符串的开头和结尾时,此选项才有效。

请尝试这样更改您的模式

(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)
(]*>).*(]+>).*()

此XML格式不正确,与解析无关
print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)
(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)