Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 生成";特别";tsv中仅列索引的字典结构_Python_Function_Csv_Dictionary_Categorical Data - Fatal编程技术网

Python 生成";特别";tsv中仅列索引的字典结构

Python 生成";特别";tsv中仅列索引的字典结构,python,function,csv,dictionary,categorical-data,Python,Function,Csv,Dictionary,Categorical Data,设想一个选项卡分隔的文件,如下图所示: 9606 1 GO:0002576 TAS-血小板脱颗粒-过程 9606 1 GO:0003674 ND-分子函数 9606 1 GO:0003674 OOO-分子函数 9606 1 GO:0005576 IDA-细胞外区域-成分 9606 1 GO:0005576 TA-细胞外区域-成分 9606 1 GO:0005576 OOO-细胞外区域-成分 9606 1 GO:0005615 HDA-细胞外空间-组件 9606 1 GO:0008150 ND-生

设想一个选项卡分隔的文件,如下图所示:

9606 1 GO:0002576 TAS-血小板脱颗粒-过程
9606 1 GO:0003674 ND-分子函数
9606 1 GO:0003674 OOO-分子函数
9606 1 GO:0005576 IDA-细胞外区域-成分
9606 1 GO:0005576 TA-细胞外区域-成分
9606 1 GO:0005576 OOO-细胞外区域-成分
9606 1 GO:0005615 HDA-细胞外空间-组件
9606 1 GO:0008150 ND-生物处理-处理
9606 1 GO:0008150 OOO-生物处理-处理
9606 1 GO:0008150 HHH-生物处理-处理
9606 1 GO:0008150 YYY-生物过程-过程
9606 1 GO:0031012 IDA-细胞外基质-成分
9606 1 GO:0043312 TAS-中性粒细胞脱颗粒-过程
我想创建一个函数,它接收包含要保存的信息的列数,并返回一个“特殊”字典。我之所以说“特殊”,是因为在我的例子中,信息总是分类的,但它可以有不同的层次,我厌倦了不断地编写逻辑来为每个层次添加信息。(也许还有另一种方法,我无法寻找,因此,我为我的无知表示歉意)

如果指定的列为8、2和3。类别最高的列为8,类别最低的列为3,可获得预期词典:

three_userinput = "8:2:3"
three = map(lambda x: int(x) - 1, three_userinput.split(":"))
DICT3 = {}
for line in file_handle:
info = line.split("\t")
    if info[three[0]] in DICT3:
        if info[three[1]] in DICT3[info[three[0]]]:
            DICT3[info[three[0]]][info[three[1]]].add(info[three[2]])
        else:
            DICT3[info[three[0]]][info[three[1]]] = set([info[three[2]]])
    else:
        DICT3[info[three[0]]] = {info[three[1]]:set([info[three[2]]])}

pprint.pprint(DICT3)
four_userinput = "8:2:3:4"
four = map(lambda x: int(x) - 1, four_userinput.split(":"))
DICT4 = {}
for line in file_handle:
    info = line.split("\t")
    if info[four[0]] in DICT4:
        if info[four[1]] in DICT4[info[four[0]]]:
            if info[four[2]] in DICT4[info[four[0]]][info[four[1]]]:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]].add(info[four[3]])
            else:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]] = set([info[four[3]]])
        else:
            DICT4[info[four[0]]][info[four[1]]] = {info[four[2]]:set([info[four[3]]])}
    else:
        DICT4[info[four[0]]] = {info[four[1]]:{info[four[2]]:set([info[four[3]]])}}

pprint.pprint(DICT4)
输出:

{'Component':{'1':set(['GO:0005576','GO:0005615','GO:0031012']),
'Function':{'1':set(['GO:0003674']),
'Process':{'1':set(['GO:0002576','GO:0008150','GO:0043312'])}
现在有四列8、2、3和4。类别最高的列为8,类别最低的列为4,可获得预期词典:

three_userinput = "8:2:3"
three = map(lambda x: int(x) - 1, three_userinput.split(":"))
DICT3 = {}
for line in file_handle:
info = line.split("\t")
    if info[three[0]] in DICT3:
        if info[three[1]] in DICT3[info[three[0]]]:
            DICT3[info[three[0]]][info[three[1]]].add(info[three[2]])
        else:
            DICT3[info[three[0]]][info[three[1]]] = set([info[three[2]]])
    else:
        DICT3[info[three[0]]] = {info[three[1]]:set([info[three[2]]])}

pprint.pprint(DICT3)
four_userinput = "8:2:3:4"
four = map(lambda x: int(x) - 1, four_userinput.split(":"))
DICT4 = {}
for line in file_handle:
    info = line.split("\t")
    if info[four[0]] in DICT4:
        if info[four[1]] in DICT4[info[four[0]]]:
            if info[four[2]] in DICT4[info[four[0]]][info[four[1]]]:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]].add(info[four[3]])
            else:
                DICT4[info[four[0]]][info[four[1]]][info[four[2]]] = set([info[four[3]]])
        else:
            DICT4[info[four[0]]][info[four[1]]] = {info[four[2]]:set([info[four[3]]])}
    else:
        DICT4[info[four[0]]] = {info[four[1]]:{info[four[2]]:set([info[four[3]]])}}

pprint.pprint(DICT4)
输出:

{'Component':{'1':{'GO:0005576':set(['IDA','OOO','TAS']),
'GO:0005615':集合(['HDA']),
'GO:0031012':set(['IDA'])},
'Function':{'1':{'GO:0003674':set(['ND','OOO'])},
'Process':{'1':{'GO:0002576':set(['TAS']),
'GO:0008150':集合(['HHH','ND','OOO','YYY']),
'GO:0043312':集合(['TAS'])}
现在,当我面对五个级别的信息(五列)时,代码几乎无法阅读,而且非常乏味。。。我可以为每个级别创建特定的函数,但是。。有没有办法设计一个可以处理任意级别的函数

如果我没有解释清楚,请不要犹豫问我。

你需要的是一份工作。这允许您更新条目,而无需首先测试它们是否存在。i、 e.如果不存在,则自动添加默认值。由于您有多个级别,因此需要使用
build\u defaultdict(levels)
函数递归地创建嵌套的defaultdict。设置值也需要递归,但逻辑更简单:

import pprint
import csv
from operator import itemgetter
from collections import defaultdict


def build_defaultdict(levels):
    return defaultdict(set) if levels <= 1 else defaultdict(lambda : build_defaultdict(levels - 1))


def set_value(d, row):
    if len(row) <= 2:
        d[row[0]].add(row[1])
    else:
        d[row[0]] = set_value(d[row[0]], row[1:])

    return d


req_cols = [7, 1, 2, 3]     # counting from col 0

data = build_defaultdict(len(req_cols) - 1)
get_cols = itemgetter(*req_cols)

with open('input.csv', 'r', newline='') as f_input:
    for row in csv.reader(f_input, delimiter='\t'):
        set_value(data, get_cols(row))

pprint.pprint(data)
print(data['Component']['1']['GO:0005576'])        
导入pprint
导入csv
从运算符导入itemgetter
从集合导入defaultdict
def build_DEFAULT DICT(级别):

如果级别则返回defaultdict(set)您可以定义一个递归函数来执行此操作

def update_nested_dict(d, vars):
    if len(vars) > 2:
        try:
            d[vars[0]] = update_nested_dict(d[vars[0]], vars[1:])
        except KeyError:
            d[vars[0]] = update_nested_dict({}, vars[1:])
    else:
        try:
            d[vars[0]] = d[vars[0]].union([vars[1]])
        except KeyError:
            d[vars[0]] = set([vars[1]])
    return d
根据需要保留尽可能多的代码逻辑和变量名

>>> userinput = "8:2:3:4"
>>> cols = map(lambda x: int(x) - 1, userinput.split(":"))
>>> 
>>> DICT = {}
>>> 
>>> for line in file_handle:
>>>     info = line.replace("\n", "").split("\t")
>>>     names = [info[c] for c in cols]
>>>     _ = update_nested_dict(DICT, names)
>>>
>>> for k, v in DICT.iteritems():
...  print k, v
...
Process {'1': {'GO:0002576': set(['TAS']), 'GO:0008150': set(['YYY', 'OOO', 'HHH', 'ND']), 'GO:0043312': set(['TAS'])}}
Function {'1': {'GO:0003674': set(['OOO', 'ND'])}}
Component {'1': {'GO:0005576': set(['OOO', 'IDA', 'TAS']), 'GO:0005615': set(['HDA']), 'GO:0031012': set(['IDA'])}}

这是什么样的编辑。。。没有附加值,最让我恼火的是。。。为什么我对社区的“感谢”被删除了?这是更快的,虽然它解决了逻辑中繁琐的部分,但字典的结构必须在开始时为每个级别预定义,并介绍信息。我需要找到一种方法来重新定义这两个部分,以便对任何级别都有效。
defaultdict
方法也可以是递归的。我已经更新了答案。