Python 如何解析纯MEDLINE格式文件_Python_Parsing_Pubmed

Python 如何解析纯MEDLINE格式文件

python parsing

Python 如何解析纯MEDLINE格式文件,python,parsing,pubmed,Python,Parsing,Pubmed,我需要使用以下结构处理MEDLINE文件： PMID- 1 OWN - NLM STAT- MEDLINE DCOM- 20121113 TI - Formate assay in body fluids: application in methanol poisoning. PMID- 2 OWN - NLM STAT- MEDLINE DCOM- 20121113 TI - Delineation of the intimate details of the backbone conf

我需要使用以下结构处理MEDLINE文件：

PMID- 1
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI  - Formate assay in body fluids: application in methanol poisoning.

PMID- 2
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI  - Delineation of the intimate details of the backbone conformation of pyridine
      nucleotide coenzymes in aqueous solution.

PMID- 21
OWN - NLM
STAT- MEDLINE
DCOM- 20121113
TI  - [Biochemical studies on camomile components/III. In vitro studies about the
      antipeptic activity of (--)-alpha-bisabolol (author's transl)].
AB  - (--)-alpha-Bisabolol has a primary antipeptic action depending on dosage, which
      is not caused by an alteration of the pH-value. The proteolytic activity of
      pepsin is reduced by 50 percent through addition of bisabolol in the ratio of
      1/0.5. The antipeptic action of bisabolol only occurs in case of direct contact.
      In case of a previous contact with the substrate, the inhibiting effect is lost.

主要任务是只打印属于PMID、TI和AB字段的行。然而，我从下面粘贴的脚本开始

问题：不知道为什么

med.records

对象在处理结束时为空？任何想法都值得赞赏

import re

class Medline:
    """ MEDLINE file structure """
    def __init__(self, in_file=None):
        """ Initialize and parse input """
        self.records = []
        if in_file:
            self.parse(in_file)

    def parse(self, in_file):
        """ Parse input file """
        self.current_tag = None
        self.current_record = None
        prog = re.compile("^(....)- (.*)")
        lines = []
        # Skip blank lines
        for line in in_file:
            line = line.rstrip()
            if line == "":
                continue
            if not line.startswith("      "):
                match = prog.match(line)
                if match:
                    tag = match.groups()[0]
                    field = match.groups()[1]
                    self.process_field(tag, field)

    def process_field(self, tag, field):
        """ Process MEDLINE file field """
        if tag == "PMID":
            self.current_record = {tag: field}

def main():
    """ Test the code """
    import pprint
    with open("medline_file.txt", "rt") as medline_file:
        med = Medline(medline_file)
        pp = pprint.PrettyPrinter()
        pp.pprint(med.records)

if __name__ == "__main__":
    main()

这是个打字错误

您可以将标签和字段保存在

process\u字段（self，tag，field）

中的

self.current\u记录中
self.current_record = {tag: field}

但后来你什么都不做。主要是打印字段记录：
pp.pprint(med.records)

你从不在里面加任何东西。所以它当然是空的
一种解决办法是：
def process_field(self, tag, field):
    """ Process MEDLINE file field """
    if tag == "PMID":
        self.records.append({tag: field})

这将产生输出：
[{'PMID': '1'}, {'PMID': '2'}, {'PMID': '21'}]

另外，你说AB场很重要。不要忘记，因为你有这一行：如果没有这一行。用（“”）开始：
只有AB的第一行会被保存为标签（例如：AB-（-）-α-比沙布洛尔有一个主要的抗肽作用，这取决于剂量，
），所有其他行都会被过滤。
你可能可以使用biopython软件包来完成。他们有一个专门用于medline的模块：