Python 创建一个字典,从大数据帧中多次拆分数据帧

Python 创建一个字典,从大数据帧中多次拆分数据帧,python,pandas,database,dataframe,data-analysis,Python,Pandas,Database,Dataframe,Data Analysis,我有一个乱七八糟的大CSV文件,有很多Nan值,我使用pd.read\u CSV(file,names=range(int))读取数据帧。我想将这些数据分割成多个数据帧,并使用给定的密钥存储在字典中。我准备了一个简单的例子来解释我的问题 示例rawdata:我的数据看起来与给定数据类似,但列数和行数更多 import pandas as pd import numpy as np df = pd.DataFrame(columns=([1,2,3,4])) df.loc[0,:] = ['Ho

我有一个乱七八糟的大CSV文件,有很多Nan值,我使用
pd.read\u CSV(file,names=range(int))
读取数据帧。我想将这些数据分割成多个数据帧,并使用给定的密钥存储在字典中。我准备了一个简单的例子来解释我的问题

示例rawdata:我的数据看起来与给定数据类似,但列数和行数更多

import pandas as pd
import numpy as np
df = pd.DataFrame(columns=([1,2,3,4]))
df.loc[0,:] =  ['Home -AA',np.nan,np.nan,np.nan]
df.loc[1,:] =  ['place/time','value1','value2','value3']
df.loc[2,:] = ['Home time1',1, 2, 3]
df.loc[3,:] = ['Home time2',4, 5, 6]
df.loc[4,:] = ['Home time3',7, 8, 9]
df.loc[5,:] = ['sum',11,np.nan , np.nan] 
df.loc[6,:] = ['agg',12,np.nan , np.nan] 
df.loc[7,:] = ['max',6,np.nan , np.nan] 
df.loc[8,:] = ['min',8,np.nan , np.nan] 
df.loc[9,:] = ['med',1,np.nan , np.nan] 
df.loc[10,:] =  ['Home -BB',np.nan,np.nan,np.nan]
df.loc[11,:] =  ['place/time','value1','value2','value3']
df.loc[12,:] = ['Home time1',11, 12, 13]
df.loc[13,:] = ['Home time2',14, 15, 16]
df.loc[14,:] = ['Home time3',17, 18, 19]
df.loc[15,:] = ['sum',101,np.nan , np.nan] 
df.loc[16,:] = ['agg',122,np.nan , np.nan] 
df.loc[17,:] = ['max',62,np.nan , np.nan] 
df.loc[18,:] = ['min',83,np.nan , np.nan] 
df.loc[19,:] = ['med',12,np.nan , np.nan] 
df.loc[20,:] =  ['Home -CC',np.nan,np.nan,np.nan]
df.loc[21,:] =  ['place/time','value1','value2','value3']
df.loc[22,:] =  ['Home -DD',np.nan,np.nan,np.nan]
df.loc[23,:] =  ['place/time','value1','value2','value3']
df.loc[24,:] =  ['Home -EE',np.nan,np.nan,np.nan]
df.loc[25,:] =  ['place/time','value1','value2','value3']
df.loc[26,:] =  ['Home -FF',np.nan,np.nan,np.nan]
df.loc[27,:] =  ['place/time','value1','value2','value3']
df.loc[28,:] = ['Home time1',211, 212, 213]
df.loc[29,:] = ['Home time1',212, 213, 214]
df.loc[30,:] = ['sum',115,np.nan , np.nan] 
df.loc[31,:] = ['agg',124,np.nan , np.nan] 
df.loc[32,:] = ['max',65,np.nan , np.nan] 
df.loc[33,:] = ['min',85,np.nan , np.nan] 
df.loc[34,:] = ['med',16,np.nan , np.nan] 
想要的结果:我想将此数据帧转换为多个数据帧,并定义房屋钥匙,存储在dictionary dict1中。(结果示例)

使用for循环编写代码,但我无法以正确的方式拆分所有数据帧。如果我没有中断循环,那么我将收到一个错误(列表索引超出范围)。你能帮我得到我上面解释过的类似结果吗

准备好的代码想法


您可以通过循环整个数据帧并在分隔符行上发出较小的数据帧来实现这一点。这是蛮力,但有效

results = {}
for i, row in df.iterrows():
    if "Home -" in row[1]:
        accumulator = pd.DataFrame(columns=[1, 2, 3, 4])
        key = row[1]
        results[key] = accumulator
    else:
        results[key] = results[key].append(row)
输出:

In [9]: results
Out[9]:
{'Home -AA':             1       2       3       4
 1  place/time  value1  value2  value3
 2  Home time1       1       2       3
 3  Home time2       4       5       6
 4  Home time3       7       8       9
 5         sum      11     NaN     NaN
 6         agg      12     NaN     NaN
 7         max       6     NaN     NaN
 8         min       8     NaN     NaN
 9         med       1     NaN     NaN,
 'Home -BB':              1       2       3       4
 11  place/time  value1  value2  value3
 12  Home time1      11      12      13
 13  Home time2      14      15      16
 14  Home time3      17      18      19
 15         sum     101     NaN     NaN
 16         agg     122     NaN     NaN
 17         max      62     NaN     NaN
 18         min      83     NaN     NaN
 19         med      12     NaN     NaN,
 'Home -CC':              1       2       3       4
 21  place/time  value1  value2  value3,
 'Home -DD':              1       2       3       4
 23  place/time  value1  value2  value3,
 'Home -EE':              1       2       3       4
 25  place/time  value1  value2  value3,
 'Home -FF':              1       2       3       4
 27  place/time  value1  value2  value3
 28  Home time1     211     212     213
 29  Home time1     212     213     214
 30         sum     115     NaN     NaN
 31         agg     124     NaN     NaN
 32         max      65     NaN     NaN
 33         min      85     NaN     NaN
 34         med      16     NaN     NaN}

获取列表索引超出范围错误的原因是循环中的
y
正在使用列表的
i+1
th值
rawname
。因此,您只希望循环到
len(rawname)-1
,如下所示:

test = {}
for i in range(len(rawname)-1):
    x = df[df[1]==rawname[i]].index.values
    y = df[df[1]==rawname[i+1]].index.values
    df_1 = df.iloc[x[0]:y[0], :]
    test[rawname[i]] = df_1

您只需使用
groupby
cumsum

result = {}

for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
    name, d = i[1].iat[0], i.iloc[1:]
    result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]

print (result)

{'Home -AA':             1       2       3       4
1  place/time  value1  value2  value3
2  Home time1       1       2       3
3  Home time2       4       5       6
4  Home time3       7       8       9, 
'Home -BB':              1       2       3       4
11  place/time  value1  value2  value3
12  Home time1      11      12      13
13  Home time2      14      15      16
14  Home time3      17      18      19, 
'Home -CC':              1       2       3       4
21  place/time  value1  value2  value3, 
'Home -DD':              1       2       3       4
23  place/time  value1  value2  value3, 
'Home -EE':              1       2       3       4
25  place/time  value1  value2  value3, 
'Home -FF':              1       2       3       4
27  place/time  value1  value2  value3
28  Home time1     211     212     213
29  Home time1     212     213     214}
test = {}
for i in range(len(rawname)-1):
    x = df[df[1]==rawname[i]].index.values
    y = df[df[1]==rawname[i+1]].index.values
    df_1 = df.iloc[x[0]:y[0], :]
    test[rawname[i]] = df_1
result = {}

for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
    name, d = i[1].iat[0], i.iloc[1:]
    result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]

print (result)

{'Home -AA':             1       2       3       4
1  place/time  value1  value2  value3
2  Home time1       1       2       3
3  Home time2       4       5       6
4  Home time3       7       8       9, 
'Home -BB':              1       2       3       4
11  place/time  value1  value2  value3
12  Home time1      11      12      13
13  Home time2      14      15      16
14  Home time3      17      18      19, 
'Home -CC':              1       2       3       4
21  place/time  value1  value2  value3, 
'Home -DD':              1       2       3       4
23  place/time  value1  value2  value3, 
'Home -EE':              1       2       3       4
25  place/time  value1  value2  value3, 
'Home -FF':              1       2       3       4
27  place/time  value1  value2  value3
28  Home time1     211     212     213
29  Home time1     212     213     214}