Python正则表达式-从orgmode文件获取项_Python_Regex_Org Mode

Python正则表达式-从orgmode文件获取项

python regex

Python正则表达式-从orgmode文件获取项,python,regex,org-mode,Python,Regex,Org Mode,我有以下组织模式语法： ** Hardware [0/1] - [ ] adapt a programmable motor to a tripod to be used for panning ** Reading - Technology [1/6] - [X] Introduction to Networking - Charles Severance - [ ] A Tour of C++ - Bjarne Stroustrup - [ ] C++ How to Program

我有以下组织模式语法：

** Hardware [0/1]
 - [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6]
 - [X] Introduction to Networking - Charles Severance
 - [ ] A Tour of C++ - Bjarne Stroustrup
 - [ ] C++ How to Program - Paul Deitel
 - [X] Computer Systems - Randal Bryant
 - [ ] The C programming language - Brian Kernighan
 - [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
 - [ ] Patrick McKeown - The Oxygen Advantage
 - [X] Total Knee Health - Martin Koban
 - [X] Supple Leopard - Kelly Starrett
 - [X] Convict Conditioning 1 and 2

我想提取项目，例如：

 getitems "Hardware"

我应该得到：

  - [ ] adapt a programmable motor to a tripod to be used for panning

 - [ ] Patrick McKeown - The Oxygen Advantage
 - [X] Total Knee Health - Martin Koban
 - [X] Supple Leopard - Kelly Starrett
 - [X] Convict Conditioning 1 and 2

如果我要求“阅读-健康”，我应该得到：

  - [ ] adapt a programmable motor to a tripod to be used for panning

 - [ ] Patrick McKeown - The Oxygen Advantage
 - [X] Total Knee Health - Martin Koban
 - [X] Supple Leopard - Kelly Starrett
 - [X] Convict Conditioning 1 and 2

我使用以下模式：

   pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)

要求“阅读技术”时的输出为：

我还尝试：

   pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)

除最后一个标题外，最后一个标题适用于所有标题

请求“阅读-健康”时的输出：

如你所见，它与最后一行不匹配

我使用的是python 2.7和findall。

不确定整个匹配是否需要正则表达式。我只需要使用正则表达式来匹配

**

行，然后返回行，直到看到下一行

**

为止

差不多

pattern = re.compile("\*\* "+ head)

start = False
output = []
for line in my_file:
    if pattern.match(line):
         start = True
         continue
    elif line.startswith("**"): # but doesn't match pattern
        break

    if start:
        output.append(line)

# now `output` should have the lines you want

如果您确定项目中不存在字符

，则可以使用：

re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")

你可以通过

import re

string = """
** Hardware [0/1]
 - [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6]
 - [X] Introduction to Networking - Charles Severance
 - [ ] A Tour of C++ - Bjarne Stroustrup
 - [ ] C++ How to Program - Paul Deitel
 - [X] Computer Systems - Randal Bryant
 - [ ] The C programming language - Brian Kernighan
 - [ ] Beginning Linux Programming -Matthew and Stones
** Reading - Health [3/4]
 - [ ] Patrick McKeown - The Oxygen Advantage
 - [X] Total Knee Health - Martin Koban
 - [X] Supple Leopard - Kelly Starrett
 - [X] Convict Conditioning 1 and 2  
 """

def getitems(section):
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
    try:
        items = rx.search(string)
        return items.group('block')
    except:
        return None

items = getitems('Reading - Technology')
print(items)

正则表达式非常擅长匹配结构化数据，比如始终具有特定格式的行。当你必须在你关心的行之间匹配一堆随机文本时，使用它会变得非常棘手，这就是为什么我通常避免使用你试图使用的方法。第二眼看，我的答案中的

模式。匹配也可能是行。开始时使用（**“+头）
\*\*阅读-健康（**？）（？：\*\*\$）谢谢，它改进了整个代码..而且regex101.com是一个很棒的工具website@daleonpz：添加了仅返回选定值的版本。
^\*{2}.+[\n\r]       # match the beginning of the line, followed by two stars, anything else in between and a newline
(?P<block>           # open group "block"
    (?:              # non-capturing group
        (?!^\*{2})   # a neg. lookahead, making sure no ** follows at the beginning of a line
        [\s\S]       # any character...
    )+               # ...at least once
)                    # close group "block"

def getitems(section, selected=None):
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE)
    try:
        items = rx.search(string).group('block')
        if selected:
            rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE)
            try:
                selected_items = rxi.findall(items)
                return selected_items
            except:
                return None
         return items
    except:
        return None

items = getitems('Reading - Health', selected=True)
print(items)