在python中，如何使用正则表达式提取字符串？_Python_Regex

在python中，如何使用正则表达式提取字符串？

python regex

在python中，如何使用正则表达式提取字符串？,python,regex,Python,Regex,我想编写一个简单的markdown解析器函数，它将接受一行markdown并被翻译成适当的HTML。为了简单起见，我只想在atx语法中支持标记的一个特性：头标题由（1-6）个散列、空格和文本指定。哈希数决定HTML输出的标题级别。示例 # Header will become <h1>Header</h1> ## Header will become <h2>Header</h2> ###### Header will become <

我想编写一个简单的markdown解析器函数，它将接受一行markdown并被翻译成适当的HTML。为了简单起见，我只想在atx语法中支持标记的一个特性：头

标题由（1-6）个散列、空格和文本指定。哈希数决定HTML输出的标题级别。示例

# Header will become <h1>Header</h1>

## Header will become <h2>Header</h2>

###### Header will become <h6>Header</h6>

#标题将变为标题
##标题将变为标题
######标题将变为标题

规则如下所示

# Header will become <h1>Header</h1> ## Header will become <h2>Header</h2> ###### Header will become <h6>Header</h6>
标题内容应仅位于初始哈希标记加上空格字符之后
无效的标题应作为收到的标记返回，无需翻译
在结果输出中，应忽略标题内容和哈希标记前后的空格
这是我写的代码

import re def markdown_parser(markdown): results ='' pattern = re.compile("#+\s") matches = pattern.search(markdown.strip()) if (matches != None): tag = matches[0] hashTagLen = len(tag) - 1 htmlTag = "h" + str(hashTagLen) content = markdown.strip()[(hashTagLen + 1):] results = "<" + htmlTag + ">" + content + "</" + htmlTag + ">" else: results = markdown return results

重新导入 def标记\u解析器（标记）：结果=“” 模式=重新编译（“#+\s”） matches=pattern.search（markdown.strip（））如果（匹配！=无）：标记=匹配项[0] hashTagLen=len（标记）-1 htmlTag=“h”+str（hashTagLen） content=markdown.strip（）[（hashTagLen+1）：] 结果=”“+内容+“” 其他：结果=降价返回结果
当我运行这段代码时，出现了如下异常
未处理的异常：“\u sre.sre\u Match”对象不可下标
我不知道为什么会发生这个错误
当我在shell上运行脚本时，它运行得很好。但当我在unittest环境（导入unittest）上运行它时，出现了错误

请帮帮我。
您不能使用索引来访问匹配对象。
您可以使用
re.sub
将一个替换为6个
，后跟一个空格和一个单词（模式为
（\w+）
）和您想要的html

re.sub
可与处理更换的功能一起使用

import re def replacer(m): return '<h{level}>{header}</h{level}>'.format(level=len(m.group(1)), header=m.group(2)) def markdown_parser(markdown): results = [re.sub(r'(#{1,6}) (\w+)', replacer, line) for line in markdown.split('\n')] return "\n".join(results).strip() sourceText = "##header#content## smaller header#contents### something" print(markdown_parser(sourceText))

重新导入 def更换器（m）：返回“{header}”。格式（level=len（m.group（1）），header=m.group（2）） def标记\u解析器（标记）： results=[re.sub（r'（#{1,6}）（\w+），replace，line），用于markdown.split（'\n'）中的行] 返回“\n”.join（结果）.strip（） sourceText=“##标题#内容##小标题#内容##某物” 打印（markdown_解析器（sourceText））

打印
##header#contentsmaller header#contentssmething
这段代码看起来非常冗长，很多逻辑可以在regex中执行
如果您查看用perl编写的原始markdown库，您可以看到只需要一个模式，然后，从第一个捕获组中，您可以获得它是什么样式的头

sub\u-DoHeaders{ 我的$text=shift； #Setext样式标题： #标题1 # ======== # #标题2 # -------- # $text=~s{^（.+）[\t]*\n=+[\t]*\n+}{ “.\u运行范围（$1）。”\n\n； }egmx； $text=~s{^（.+）[\t]*\n-+[\t]*\n+}{ “.\u运行范围（$1）。”\n\n； }egmx； #atx样式标题： ##标题1 ###标题2 ###带有结束散列的标题2## # ... #标题6 # $text=~s{ ^（\\{1,6}）\$1=字符串 [\t]* （.+？）#$2=标题文本 [\t]* \#*#可选关闭#（不计算在内） \n+ }{ 我的$h_级别=长度（$1）； “.\u运行范围（$2）。”\n\n； }egmx；返回$text；
}
除非，由于某些原因，您不能，否则最好使用markdown库，因为它是原始库、缺点和所有特性的实现
您可以看到Markdown Python库是如何实现它的

类HashHeaderProcessor（块处理器）： “”“处理哈希头。”“” #在块中任何行的开头检测标头 RE=RE.compile（r'（^ |\n）（？P#{1,6}）（？P.？）*（\n |$）） def测试（自身、父级、块）：返回布尔值（自搜索（块）） def运行（自身、父级、块）： block=blocks.pop（0） m=自我重新搜索（块）如果m： before=块[：m.start（）]#标题前的所有行 after=block[m.end（）：]#标题后面的所有行如果之前： #因为标题不是块的第一行，所以 #必须首先解析标头之前的行， #将这些行递归地解析为一个块。 self.parser.parseBlocks（父，[之前]） #使用RE中的命名组创建标头 h=util.etree.SubElement（父级，'h%d'%len（m.group（'level'）） h、 text=m.group（'header'）.strip（）如果在以下时间之后： #插入剩余的行作为将来解析的第一个块。块。插入（0，后面）其他：#布拉格语：无封面 #这不应该发生，但以防万一。。。 logger.warn（“我们的标题有问题：%r”%block）
matches不是匹配的项目，而是匹配的对象，使用
matches.group
与它们交互，cf:
m=re.search（'）（？ class HashHeaderProcessor(BlockProcessor): """ Process Hash Headers. """ # Detect a header at start of any line in block RE = re.compile(r'(^|\n)(?P<level>#{1,6})(?P<header>.*?)#*(\n|$)') def test(self, parent, block): return bool(self.RE.search(block)) def run(self, parent, blocks): block = blocks.pop(0) m = self.RE.search(block) if m: before = block[:m.start()] # All lines before header after = block[m.end():] # All lines after header if before: # As the header was not the first line of the block and the # lines before the header must be parsed first, # recursively parse this lines as a block. self.parser.parseBlocks(parent, [before]) # Create header using named groups from RE h = util.etree.SubElement(parent, 'h%d' % len(m.group('level'))) h.text = m.group('header').strip() if after: # Insert remaining lines as first block for future parsing. blocks.insert(0, after) else: # pragma: no cover # This should never happen, but just in case... logger.warn("We've got a problem header: %r" % block)