Python正则表达式-从orgmode文件获取项
我有以下组织模式语法:Python正则表达式-从orgmode文件获取项,python,regex,org-mode,Python,Regex,Org Mode,我有以下组织模式语法: ** Hardware [0/1] - [ ] adapt a programmable motor to a tripod to be used for panning ** Reading - Technology [1/6] - [X] Introduction to Networking - Charles Severance - [ ] A Tour of C++ - Bjarne Stroustrup - [ ] C++ How to Program
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我想提取项目,例如:
getitems "Hardware"
我应该得到:
- [ ] adapt a programmable motor to a tripod to be used for panning
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
如果我要求“阅读-健康”,我应该得到:
- [ ] adapt a programmable motor to a tripod to be used for panning
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
我使用以下模式:
pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)
要求“阅读技术”时的输出为:
我还尝试:
pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)
除最后一个标题外,最后一个标题适用于所有标题
请求“阅读-健康”时的输出:
如你所见,它与最后一行不匹配
我使用的是python 2.7和findall。不确定整个匹配是否需要正则表达式。我只需要使用正则表达式来匹配
**
行,然后返回行,直到看到下一行**
为止
差不多
pattern = re.compile("\*\* "+ head)
start = False
output = []
for line in my_file:
if pattern.match(line):
start = True
continue
elif line.startswith("**"): # but doesn't match pattern
break
if start:
output.append(line)
# now `output` should have the lines you want
如果您确定项目中不存在字符
*
,则可以使用:
re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")
你可以通过
import re
string = """
** Hardware [0/1]
- [ ] adapt a programmable motor to a tripod to be used for panning
** Reading - Technology [1/6]
- [X] Introduction to Networking - Charles Severance
- [ ] A Tour of C++ - Bjarne Stroustrup
- [ ] C++ How to Program - Paul Deitel
- [X] Computer Systems - Randal Bryant
- [ ] The C programming language - Brian Kernighan
- [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
- [ ] Patrick McKeown - The Oxygen Advantage
- [X] Total Knee Health - Martin Koban
- [X] Supple Leopard - Kelly Starrett
- [X] Convict Conditioning 1 and 2
"""
def getitems(section):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string)
return items.group('block')
except:
return None
items = getitems('Reading - Technology')
print(items)
正则表达式非常擅长匹配结构化数据,比如始终具有特定格式的行。当你必须在你关心的行之间匹配一堆随机文本时,使用它会变得非常棘手,这就是为什么我通常避免使用你试图使用的方法。第二眼看,我的答案中的
模式。匹配也可能是行。开始时使用(**“+头)
\*\*阅读-健康(**?)(?:\*\*\$)
谢谢,它改进了整个代码..而且regex101.com是一个很棒的工具website@daleonpz:添加了仅返回选定值的版本。
^\*{2}.+[\n\r] # match the beginning of the line, followed by two stars, anything else in between and a newline
(?P<block> # open group "block"
(?: # non-capturing group
(?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line
[\s\S] # any character...
)+ # ...at least once
) # close group "block"
def getitems(section, selected=None):
rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
try:
items = rx.search(string).group('block')
if selected:
rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE)
try:
selected_items = rxi.findall(items)
return selected_items
except:
return None
return items
except:
return None
items = getitems('Reading - Health', selected=True)
print(items)