将文件中的多行记录读入python中的嵌套字典_Python_Dictionary_Nested

将文件中的多行记录读入python中的嵌套字典

python dictionary

将文件中的多行记录读入python中的嵌套字典,python,dictionary,nested,Python,Dictionary,Nested,这是练习嵌套字典，或字典数组，字典列表等。在C/C++中，数据结构可以最好地描述为struct/class数组，每个struct都有多个成员。对我的挑战： 1). 在每条记录的开头有一个字符串“样本名称”，后跟多个成员 2). 每行记录的6个成员，以冒号分隔“；” 3). 如何将多行（而不是同一行的多个字段，这更容易解析）读入记录的成员中 4). 记录分隔符前面不能有空行。我将样本输入和预期输出放入测试中。示例：input.txt Sample Name: CanNAM1_192 SNPs

这是练习嵌套字典，或字典数组，字典列表等。在C/C++中，数据结构可以最好地描述为struct/class数组，每个struct都有多个成员。对我的挑战：
1). 在每条记录的开头有一个字符串“样本名称”，后跟多个成员
2). 每行记录的6个成员，以冒号分隔“；”
3). 如何将多行（而不是同一行的多个字段，这更容易解析）读入记录的成员中
4). 记录分隔符前面不能有空行。
我将样本输入和预期输出放入测试中。
示例：input.txt

Sample Name: CanNAM1_192 SNPs : 5392 MNPs : 0 Insertions : 248 Deletions : 359 Phased Genotypes : 8.8% (2349/26565) MNP Het/Hom ratio : - (0/0) Sample Name: CanNAM2_195 SNPs : 5107 MNPs : 0 Insertions : 224 Deletions : 351 Phased Genotypes : 8.9% (2375/26560) MNP Het/Hom ratio : - (0/0) Sample Name: CanNAM3_196 SNPs : 4926 MNPs : 0 Insertions : 202 Deletions : 332 Phased Genotypes : 8.0% (2138/26582) MNP Het/Hom ratio : - (0/0) 我一直在解析每条记录，每次读7行，然后将记录推/更新到字典中。不擅长Python，我真的非常感谢您的帮助

data = {}
with open("data.txt",'r') as fh:
    for line in fh.readlines(): #read in multiple lines
        if len(line.strip())==0:
            continue

        if line.startswith('Sample Name'):
            nameLine = line.strip()
            name = nameLine.split(": ")[1]
            data[name] = {}
        else:
            splitLine = line.split(":")
            variableName = splitLine[0].strip()
            value = splitLine[1].strip()
            data[name][variableName] = value

print(data)

确保您正在阅读的行不是空的。如果你从一个空行中去掉所有的空格，你会得到一个长度为零的字符串。我们只是检查一下

如果该行以

示例名称

开头，我们知道id将位于冒号和空格之后。我们可以按这些字符分割。id将是分割行的第二部分，因此我们只得到索引1处的项

在变量中跟踪当前id（我称之为

name

）。为该id创建空的嵌套字典条目

如果该行不是ID行，则它必须是与上次输入的ID关联的数据行

我们得到一行，按

：

拆分。变量的名称将在左边，第一项，值将在右边，所以第二项。一定要把两边多余的空间都去掉

将变量和值对添加到ID的字典条目中

在这个问题上花了更多的时间，得到了一个解决方案，它看起来是“强＞不<强”>“Python”，因为我的代码处理第一个“记录”（8行数据，包括底部的空白行）是多余的。

import itertools
data = {}
with open("vcfstats.txt", 'r') as f:
    for line in f:
        if line.strip():                #Non blank line
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value
        else:
             continue

    for line in itertools.islice(f, 8):
        lines = (line.rstrip() for line in f)          # including blank lines
        lines = list(line for line in lines if line)   # skip blank lines

        for line in lines:
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value

我错过了什么？非常感谢

谢谢！我把这个问题想得太多了，因为一条记录有7行必须立即读取。还有一个问题，我的输入文件非常大（~30GB），readlines（）适合吗？我想你的问题比readlines（）更大。你有一台内存超过30gb的电脑吗？否则，计算机将无法将文件保存在内存中。我建议将数据文件分成可管理的块。不过，我不太确定readlines（）与其他方法相比的性能。这就是为什么我多次调用“next（）”函数希望为一条记录读取多行数据的原因。

data = {}
with open("data.txt",'r') as fh:
    for line in fh.readlines(): #read in multiple lines
        if len(line.strip())==0:
            continue

        if line.startswith('Sample Name'):
            nameLine = line.strip()
            name = nameLine.split(": ")[1]
            data[name] = {}
        else:
            splitLine = line.split(":")
            variableName = splitLine[0].strip()
            value = splitLine[1].strip()
            data[name][variableName] = value

print(data)

import itertools
data = {}
with open("vcfstats.txt", 'r') as f:
    for line in f:
        if line.strip():                #Non blank line
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value
        else:
             continue

    for line in itertools.islice(f, 8):
        lines = (line.rstrip() for line in f)          # including blank lines
        lines = list(line for line in lines if line)   # skip blank lines

        for line in lines:
            if line.startswith('Sample Name'):
                nameLine = line.strip()
                name = nameLine.split(": ")[1].strip()
                data[name] = {}
            else:
                splitLine = line.split(": ")
                variableName = splitLine[0].strip()
                values = splitLine[1].strip().split(" ")
                data[name][variableName] = values[0]        #Only take the first item as value