Python 将tsv文件转换为数据帧

Python 将tsv文件转换为数据帧,python,pandas,dataframe,Python,Pandas,Dataframe,从上周开始,我的tsv就面临一个问题,我想修改并转换成熊猫数据帧 我的文件如下所示: 'NC_011745.1_islands.csv': [['PAI 1 EaaA, EibA : 3.1'], ['PAI 2 EaaA : 7.75'], ['PAI 3 Capsule : 4.428571428571429'], ['

从上周开始,我的tsv就面临一个问题,我想修改并转换成熊猫数据帧

我的文件如下所示:

'NC_011745.1_islands.csv': [['PAI 1 EaaA, EibA : 3.1'],
                             ['PAI 2 EaaA : 7.75'],
                             ['PAI 3 Capsule : 4.428571428571429'],
                             ['PAI 4 EaaA : 7.75'],
                             ['PAI 5 ipaH : 7.75'],
                             ['PAI 6 IreA, IrgA homolog adhesin (Iha) : '
                              '0.96875'],
                             ['PAI 7 IrgA homolog adhesin (Iha), Aerobactin : '
                              '0.8157894736842105'],
                             ['PAI 8 MsbB2, VirK : 2.8181818181818183'],
                             ['PAI 9 Antigen 43, AIDA-I type : '
                              '1.3478260869565217']],
 'NC_017632_islands.csv': [['PAI 1 Capsule : 15.857142857142858'],
                           ['PAI 2 AAI/SCI-II, direct heme uptake system, '
                            'Colibactin, Colibactin : 1.819672131147541'],
                           ['PAI 3 F9-like fimbriae, Type 1 fimbriae : '
                            '3.3636363636363638'],
                           ['PAI 4 Ferrous iron transport : 5.045454545454546'],
                           ['PAI 5 Cah, AIDA-I type, Salmochelin, S fimbriae : '
                            '2.707317073170732'],
                           ['PAI 6 ECP, Tsh : 13.875'],
                           ['PAI 7 ACE/AEC T6SS : 9.25'],
                           ['PAI 8 Tia/Hek, P fimbriae, F17-like fimbriae, '
                            'AAI/SCI-II, CNF-1, Alpha-hemolysin, '
                            'hemagglutinin-like adhesin : 1.088235294117647']],
 'NC_017646_islands.csv': [['PAI 1 Allantion utilization : 5.285714285714286'],
                           ['PAI 2 direct heme uptake system : 4.44'],
                           ['PAI 3 ipaH : 27.75'],
                           ['PAI 4 P fimbriae, Aerobactin, Sat, IrgA homolog '
                            'adhesin (Iha), K1 capsule, K1 capsule, T2SS : '
                            '1.3058823529411765'],
                           ['PAI 5 P fimbriae, Tia/Hek : 5.842105263157895'],
                           ['PAI 6 VirK, MsbB2 : 10.090909090909092']]}
我想将其修改并导出为熊猫数据帧,如下所示:

\             EaaA, EibA   EaaA   Capsule    ipaH    IreA, IrgA homolog adhesin (Iha)  ...
NC_011745.1     3.1        7.75    4.4285..  7.75                0.96875
NC_017632        NA         NA     15.8574   NA                  NA
我的主要问题是将其作为数据帧,我尝试:

df = pd.DataFrame([dict]).T
df.to_tsv()

但是它说这个功能不是和tsv一起工作,而是和csv一起工作。你不能用熊猫来做这个,熊猫是好的,但它不是魔法。在数据准备好以您想要的格式生成数据帧之前,您需要进行大量的操作。试着这样做:

_dict={'NC_011745.1_islands.csv': [['PAI 1 EaaA, EibA : 3.1'],
                             ['PAI 2 EaaA : 7.75'],
                             ['PAI 3 Capsule : 4.428571428571429'],
                             ['PAI 4 EaaA : 7.75'],
                             ['PAI 5 ipaH : 7.75'],
                             ['PAI 6 IreA, IrgA homolog adhesin (Iha) : '
                              '0.96875'],
                             ['PAI 7 IrgA homolog adhesin (Iha), Aerobactin : '
                              '0.8157894736842105'],
                             ['PAI 8 MsbB2, VirK : 2.8181818181818183'],
                             ['PAI 9 Antigen 43, AIDA-I type : '
                              '1.3478260869565217']],
 'NC_017632_islands.csv': [['PAI 1 Capsule : 15.857142857142858'],
                           ['PAI 2 AAI/SCI-II, direct heme uptake system, '
                            'Colibactin, Colibactin : 1.819672131147541'],
                           ['PAI 3 F9-like fimbriae, Type 1 fimbriae : '
                            '3.3636363636363638'],
                           ['PAI 4 Ferrous iron transport : 5.045454545454546'],
                           ['PAI 5 Cah, AIDA-I type, Salmochelin, S fimbriae : '
                            '2.707317073170732'],
                           ['PAI 6 ECP, Tsh : 13.875'],
                           ['PAI 7 ACE/AEC T6SS : 9.25'],
                           ['PAI 8 Tia/Hek, P fimbriae, F17-like fimbriae, '
                            'AAI/SCI-II, CNF-1, Alpha-hemolysin, '
                            'hemagglutinin-like adhesin : 1.088235294117647']],
 'NC_017646_islands.csv': [['PAI 1 Allantion utilization : 5.285714285714286'],
                           ['PAI 2 direct heme uptake system : 4.44'],
                           ['PAI 3 ipaH : 27.75'],
                           ['PAI 4 P fimbriae, Aerobactin, Sat, IrgA homolog '
                            'adhesin (Iha), K1 capsule, K1 capsule, T2SS : '
                            '1.3058823529411765'],
                           ['PAI 5 P fimbriae, Tia/Hek : 5.842105263157895'],
                           ['PAI 6 VirK, MsbB2 : 10.090909090909092']]}


f = {}
for key, a in _dict.items():
    e = {}
    for b in a:
        for c in b:
            d = c.split(" : ")
            d[0] = d[0].replace("PAI ", "")[2:]
            d = {d[0]:d[1]}
            e = {**e, **d}
    f[key] = e

df = pd.DataFrame.from_dict(f, 'index')

您需要制定一个健壮的方法来解析字符串(可能是正则表达式),但这应该可以让您开始了。

循环答案已经被@bm13563的字典格式接受。我用“pandas”回应

因为数据是列表格式,所以创建一个数据框。 删除括号并按“:”拆分列。 垂直连接到空数据框。 按文件名将它们分组并格式化。
谢谢你的回答,但是你能解释一下{**e,**d}的意思吗?请看这个问题以获得一个很好的解释:
import pandas as pd

lst_a = [['PAI 1 EaaA, EibA : 3.1'],['PAI 2 EaaA : 7.75'],['PAI 3 Capsule : 4.428571428571429'],['PAI 4 EaaA : 7.75'],['PAI 5 ipaH : 7.75'],['PAI 6 IreA, IrgA homolog adhesin (Iha) : ' '0.96875'],['PAI 7 IrgA homolog adhesin (Iha), Aerobactin : ' '0.8157894736842105'],['PAI 8 MsbB2, VirK : 2.8181818181818183'],['PAI 9 Antigen 43, AIDA-I type : ' '1.3478260869565217']]
lst_b = [['PAI 1 Capsule : 15.857142857142858'],['PAI 2 AAI/SCI-II, direct heme uptake system, Colibactin, Colibactin : 1.819672131147541'],['PAI 3 F9-like fimbriae, Type 1 fimbriae : 3.3636363636363638'],['PAI 4 Ferrous iron transport : 5.045454545454546'],['PAI 5 Cah, AIDA-I type, Salmochelin, S fimbriae : 2.707317073170732'],['PAI 6 ECP, Tsh : 13.875'],['PAI 7 ACE/AEC T6SS : 9.25'],['PAI 8 Tia/Hek, P fimbriae, F17-like fimbriae, AAI/SCI-II, CNF-1, Alpha-hemolysin, hemagglutinin-like adhesin : 1.088235294117647']]
lst_c = [['PAI 1 Allantion utilization : 5.285714285714286'],['PAI 2 direct heme uptake system : 4.44'],['PAI 3 ipaH : 27.75'],['PAI 4 P fimbriae, Aerobactin, Sat, IrgA homolog adhesin (Iha), K1 capsule, K1 capsule, T2SS : 1.3058823529411765'],['PAI 5 P fimbriae, Tia/Hek : 5.842105263157895'],['PAI 6 VirK, MsbB2 : 10.090909090909092']]

all_df = pd.DataFrame(index=[], columns=['col_name', 'value', 'file_name'])

filenames = ['NC_011745.1','NC_017632','NC_017646']
d_lists = [lst_a,lst_b,lst_c]

for k in range(len(d_lists)):
    df = pd.DataFrame({filenames[k]:d_lists[k]})
    df = df.astype(str)
    df[filenames[k]] = df[filenames[k]].str.replace("^\['|'\]$", "")
    df = df[filenames[k]].str.split(' : ', expand=True)
    df.columns = ['col_name','value']
    df['col_name'] = df['col_name'].apply(lambda x: x[6:])
    df['file_name'] = filenames[k]
    all_df = all_df.append(df, ignore_index=True)
    i += 1

all_df['value'] = all_df['value'].astype('float')
all_df.groupby(['file_name','col_name'])['value'].sum().unstack()

col_name    AAI/SCI-II, direct heme uptake system, Colibactin, Colibactin   ACE/AEC T6SS    Allantion utilization   Antigen 43, AIDA-I type Cah, AIDA-I type, Salmochelin, S fimbriae   Capsule ECP, Tsh    EaaA    EaaA, EibA  F9-like fimbriae, Type 1 fimbriae   Ferrous iron transport  IreA, IrgA homolog adhesin (Iha)    IrgA homolog adhesin (Iha), Aerobactin  MsbB2, VirK P fimbriae, Aerobactin, Sat, IrgA homolog adhesin (Iha), K1 capsule, K1 capsule, T2SS   P fimbriae, Tia/Hek Tia/Hek, P fimbriae, F17-like fimbriae, AAI/SCI-II, CNF-1, Alpha-hemolysin, hemagglutinin-like adhesin  VirK, MsbB2 direct heme uptake system   ipaH
file_name                                                                               
NC_011745.1 NaN NaN NaN 1.347826    NaN 4.428571    NaN 15.5    3.1 NaN NaN 0.96875 0.815789    2.818182    NaN NaN NaN NaN NaN 7.75
NC_017632   1.819672    9.25    NaN NaN 2.707317    15.857143   13.875  NaN NaN 3.363636    5.045455    NaN NaN NaN NaN NaN 1.088235    NaN NaN NaN
NC_017646   NaN NaN 5.285714    NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.305882    5.842105    NaN 10.090909   4.44    27.75