Python 创建一个字典,从大数据帧中多次拆分数据帧
我有一个乱七八糟的大CSV文件,有很多Nan值,我使用Python 创建一个字典,从大数据帧中多次拆分数据帧,python,pandas,database,dataframe,data-analysis,Python,Pandas,Database,Dataframe,Data Analysis,我有一个乱七八糟的大CSV文件,有很多Nan值,我使用pd.read\u CSV(file,names=range(int))读取数据帧。我想将这些数据分割成多个数据帧,并使用给定的密钥存储在字典中。我准备了一个简单的例子来解释我的问题 示例rawdata:我的数据看起来与给定数据类似,但列数和行数更多 import pandas as pd import numpy as np df = pd.DataFrame(columns=([1,2,3,4])) df.loc[0,:] = ['Ho
pd.read\u CSV(file,names=range(int))
读取数据帧。我想将这些数据分割成多个数据帧,并使用给定的密钥存储在字典中。我准备了一个简单的例子来解释我的问题
示例rawdata:我的数据看起来与给定数据类似,但列数和行数更多
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=([1,2,3,4]))
df.loc[0,:] = ['Home -AA',np.nan,np.nan,np.nan]
df.loc[1,:] = ['place/time','value1','value2','value3']
df.loc[2,:] = ['Home time1',1, 2, 3]
df.loc[3,:] = ['Home time2',4, 5, 6]
df.loc[4,:] = ['Home time3',7, 8, 9]
df.loc[5,:] = ['sum',11,np.nan , np.nan]
df.loc[6,:] = ['agg',12,np.nan , np.nan]
df.loc[7,:] = ['max',6,np.nan , np.nan]
df.loc[8,:] = ['min',8,np.nan , np.nan]
df.loc[9,:] = ['med',1,np.nan , np.nan]
df.loc[10,:] = ['Home -BB',np.nan,np.nan,np.nan]
df.loc[11,:] = ['place/time','value1','value2','value3']
df.loc[12,:] = ['Home time1',11, 12, 13]
df.loc[13,:] = ['Home time2',14, 15, 16]
df.loc[14,:] = ['Home time3',17, 18, 19]
df.loc[15,:] = ['sum',101,np.nan , np.nan]
df.loc[16,:] = ['agg',122,np.nan , np.nan]
df.loc[17,:] = ['max',62,np.nan , np.nan]
df.loc[18,:] = ['min',83,np.nan , np.nan]
df.loc[19,:] = ['med',12,np.nan , np.nan]
df.loc[20,:] = ['Home -CC',np.nan,np.nan,np.nan]
df.loc[21,:] = ['place/time','value1','value2','value3']
df.loc[22,:] = ['Home -DD',np.nan,np.nan,np.nan]
df.loc[23,:] = ['place/time','value1','value2','value3']
df.loc[24,:] = ['Home -EE',np.nan,np.nan,np.nan]
df.loc[25,:] = ['place/time','value1','value2','value3']
df.loc[26,:] = ['Home -FF',np.nan,np.nan,np.nan]
df.loc[27,:] = ['place/time','value1','value2','value3']
df.loc[28,:] = ['Home time1',211, 212, 213]
df.loc[29,:] = ['Home time1',212, 213, 214]
df.loc[30,:] = ['sum',115,np.nan , np.nan]
df.loc[31,:] = ['agg',124,np.nan , np.nan]
df.loc[32,:] = ['max',65,np.nan , np.nan]
df.loc[33,:] = ['min',85,np.nan , np.nan]
df.loc[34,:] = ['med',16,np.nan , np.nan]
想要的结果:我想将此数据帧转换为多个数据帧,并定义房屋钥匙,存储在dictionary dict1中。(结果示例)
使用for循环编写代码,但我无法以正确的方式拆分所有数据帧。如果我没有中断循环,那么我将收到一个错误(列表索引超出范围)。你能帮我得到我上面解释过的类似结果吗
准备好的代码想法:
您可以通过循环整个数据帧并在分隔符行上发出较小的数据帧来实现这一点。这是蛮力,但有效
results = {}
for i, row in df.iterrows():
if "Home -" in row[1]:
accumulator = pd.DataFrame(columns=[1, 2, 3, 4])
key = row[1]
results[key] = accumulator
else:
results[key] = results[key].append(row)
输出:
In [9]: results
Out[9]:
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9
5 sum 11 NaN NaN
6 agg 12 NaN NaN
7 max 6 NaN NaN
8 min 8 NaN NaN
9 med 1 NaN NaN,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19
15 sum 101 NaN NaN
16 agg 122 NaN NaN
17 max 62 NaN NaN
18 min 83 NaN NaN
19 med 12 NaN NaN,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214
30 sum 115 NaN NaN
31 agg 124 NaN NaN
32 max 65 NaN NaN
33 min 85 NaN NaN
34 med 16 NaN NaN}
获取列表索引超出范围错误的原因是循环中的
y
正在使用列表的i+1
th值rawname
。因此,您只希望循环到len(rawname)-1
,如下所示:
test = {}
for i in range(len(rawname)-1):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
您只需使用
groupby
和cumsum
:
result = {}
for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
name, d = i[1].iat[0], i.iloc[1:]
result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]
print (result)
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214}
test = {}
for i in range(len(rawname)-1):
x = df[df[1]==rawname[i]].index.values
y = df[df[1]==rawname[i+1]].index.values
df_1 = df.iloc[x[0]:y[0], :]
test[rawname[i]] = df_1
result = {}
for _, i in df.groupby(df[1].str.startswith("Home -").cumsum()):
name, d = i[1].iat[0], i.iloc[1:]
result[name] = d[~d[1].isin(["sum","agg","max","min","med"])]
print (result)
{'Home -AA': 1 2 3 4
1 place/time value1 value2 value3
2 Home time1 1 2 3
3 Home time2 4 5 6
4 Home time3 7 8 9,
'Home -BB': 1 2 3 4
11 place/time value1 value2 value3
12 Home time1 11 12 13
13 Home time2 14 15 16
14 Home time3 17 18 19,
'Home -CC': 1 2 3 4
21 place/time value1 value2 value3,
'Home -DD': 1 2 3 4
23 place/time value1 value2 value3,
'Home -EE': 1 2 3 4
25 place/time value1 value2 value3,
'Home -FF': 1 2 3 4
27 place/time value1 value2 value3
28 Home time1 211 212 213
29 Home time1 212 213 214}