Python 使用分隔符存储从文件到变量的多行_Python_Filter_Delimiter

Python 使用分隔符存储从文件到变量的多行

python filter

Python 使用分隔符存储从文件到变量的多行,python,filter,delimiter,Python,Filter,Delimiter,我正在使用Python制作一个过滤器，用于在数千个文本文件中搜索特定的查询。这些文本文件由几个部分组成，并且它们的格式并不一致。我希望检查这些部分中的每个部分是否符合特定标准，因此在名为“记录描述”的文本文件部分中，我做了类似的操作，将字符串存储到变量中： with open(some_file, 'r') as r: for line in r: if "DESCRIPTION OF RECORD" in line: record = line

我正在使用Python制作一个过滤器，用于在数千个文本文件中搜索特定的查询。这些文本文件由几个部分组成，并且它们的格式并不一致。我希望检查这些部分中的每个部分是否符合特定标准，因此在名为“记录描述”的文本文件部分中，我做了类似的操作，将字符串存储到变量中：

with open(some_file, 'r') as r:
    for line in r:
        if "DESCRIPTION OF RECORD" in line:
            record = line

现在，这对大多数文件都很有效，但有些文件的节中有一个换行符，因此它不会将整个节存储到变量中。我想知道如何使用分隔符来控制变量中存储了多少行。我可能会使用下一节“相关性”的标题作为分隔符。有什么想法吗

文件的示例结构如下所示：

CLINICAL HISTORY: Some information.
MEDICATIONS: Other information
INTRODUCTION: Some more information.
DESCRIPTION OF THE RECORD: Some information here....
another line of information
IMPRESSION: More info 
CLINICAL CORRELATION: The last bit of information

您可以像这样使用内置的

re

模块：

import re

# I assume you have a list of all possible sections
sections = [
    'CLINICAL HISTORY',
    'MEDICATIONS',
    'INTRODUCTION',
    'DESCRIPTION OF THE RECORD',
    'IMPRESSION',
    'CLINICAL CORRELATION'
]

# Build a regexp that will match any of the section names
exp = '|'.join(sections)

with open(some_file, 'r') as r:
    contents_of_file = r.read()
    infos = list(re.split(exp, contents_of_file)) # infos is a list of what's between the section names
    infos = [info.strip('\n :') for info in infos] # let's get rid of colons and whitespace in our infos
    print(infos) # you don't have to print it :)

['', 'Some information.', 'Other information', 'Some more information.', 'Some information here....\nanother line of information', 'More info', 'The last bit of information']

如果我使用您的示例文本而不是文件，它会打印如下内容：

import re

# I assume you have a list of all possible sections
sections = [
    'CLINICAL HISTORY',
    'MEDICATIONS',
    'INTRODUCTION',
    'DESCRIPTION OF THE RECORD',
    'IMPRESSION',
    'CLINICAL CORRELATION'
]

# Build a regexp that will match any of the section names
exp = '|'.join(sections)

with open(some_file, 'r') as r:
    contents_of_file = r.read()
    infos = list(re.split(exp, contents_of_file)) # infos is a list of what's between the section names
    infos = [info.strip('\n :') for info in infos] # let's get rid of colons and whitespace in our infos
    print(infos) # you don't have to print it :)

['', 'Some information.', 'Other information', 'Some more information.', 'Some information here....\nanother line of information', 'More info', 'The last bit of information']

第一个元素为空，但只需执行以下操作即可将其删除：

infos = infos[1:]

顺便说一句，如果我们将处理信息的行合并为一行，它可能会更干净，而且肯定会更高效（但可能不那么容易理解）：

如果您不知道将要找到的部分，这里有一个版本似乎可以工作，只要文本的格式与您的示例相同：

import itertools

text = """
CLINICAL HISTORY: Some information.
MEDICATIONS: Other information
INTRODUCTION: Some more information.
DESCRIPTION OF THE RECORD: Some information here....
another line of information
IMPRESSION: More info 
CLINICAL CORRELATION: The last bit of information 
"""

def method_tuple(s):
    # sp holds strings which finish with the section names.
    sp = s.split(":")
    # This line removes spurious "\n" at both end of the strings in sp.
    # It then splits them once at "\n" starting from their end, effectively
    # seperating the sections and the descriptions.
    # It builds a list of strings alternating section names and information.
    fragments = list(itertools.chain.from_iterable( p.strip("\n").rsplit("\n", 1) for p in sp ))
    # You can now build a list of 2-uples.
    pairs = [ (fragments[i*2],fragments[i*2+1]) for i in range(len(fragments)//2)]
    # Or you could build a dict
    # pairs = { fragments[i*2]:fragments[i*2+1] for i in range(len(fragments)//2)}
    return pairs

print(method_tuple(text))

与Ilya的正则表达式版本相比，时间安排大致相当，尽管在10亿个循环的示例文本上，构建一个字典似乎开始胜过构建一个元组列表或使用regexp…

我找到了另一个可能的解决方案，使用该行的索引。我首先打开check文件，并将其

f.read（）

内容存储到名为

info

的变量中。然后我这样做了：

with open(check_file, 'r') as r:
    for line in r:
        if "DESCRIPTION" in line:
            record_Index = info.index(line)
            record = info[info.index(line):]
            if "IMPRESSION" in record:
                impression_Index = info.index("IMPRESSION")
                record = info[record_Index:impression_Index]

这种方法也很有效，尽管我不知道它在内存和速度方面有多高效。与其将

与open（…）

一起多次使用，不如将其全部存储在名为

info

的变量中，然后使用该变量执行所有操作。

能否请您提供一个文件外观的示例？没问题，我刚刚做了。谢谢。谢谢，正在努力回答：）你在关注什么类型的输出？你是想收集所有的部分+描述（比如建立一个词汇表？），还是只想在找到一对后做一些操作，而不存储它？答案可能会有所不同。我发布的第一个版本由于打字错误而无法工作，我编辑了它。所以如果你有问题，试着用新的。好的，我会试一下！谢谢。只有在节中没有冒号时，这才有效。（我们可以假设没有吗？）例如，尝试将第一行更改为

临床病史：Was ill:Yes.

，按键将使用vaues切换位置。但如果我们可以，这是一个很好的答案。事实上，这就是我所说的“只要……”。事实上，如果源文本不可信，我不建议这样做，一个简单的打字错误（一个放错位置的冒号）就可以毁掉它。你的版本是健壮的，因为它知道要搜索什么，这使它更适合不受信任的来源。很抱歉，在我写评论时没有看到这一点。顺便问一下，如果你使用我的“优化”版本（1行而不是3行，2次不列出名单），它会更快吗？没问题，你对claryfying的选择是正确的！我会给你的优化版本一个机会，我是否应该发布一个答案，包括我们提出的所有版本和时间安排？我想如果你衡量我的优化版本（这会很好），你也许应该编辑你的答案。