Pandas 如何增加每列/组的索引
我需要从以下位置格式化数据帧:Pandas 如何增加每列/组的索引,pandas,dataframe,pandas-groupby,Pandas,Dataframe,Pandas Groupby,我需要从以下位置格式化数据帧: | country | county | city | street | |-----------|----------|--------|-----------| | country 1 | county 1 | city 1 | street 1 | | country 1 | county 1 | city 1 | street 2 | | country 1 | county 1 | city 2 | street 3 | | co
| country | county | city | street |
|-----------|----------|--------|-----------|
| country 1 | county 1 | city 1 | street 1 |
| country 1 | county 1 | city 1 | street 2 |
| country 1 | county 1 | city 2 | street 3 |
| country 2 | county 2 | city 3 | street 4 |
| country 2 | county 2 | city 3 | street 5 |
| country 3 | county 3 | city 4 | street 6 |
| country 3 | county 4 | city 5 | street 7 |
| country 3 | county 4 | city 6 | street 8 |
| country 3 | county 4 | city 6 | street 9 |
| country 3 | county 4 | city 6 | street 10 |
到
列的数量可能会有所不同
我正在使用multiple管理计数,并尝试用python进行格式化,但没有成功。有一种方法可以只使用pandas?我提出的解决方案适用于任意数量的列,但对于非常大的数据帧来说,它不是一个完美的解决方案:
def add_column(col_name,col_value,count,tempdf):
return tempdf.append({col_name:col_value,'count':count},ignore_index=True)
newdf = pd.DataFrame()
col_arr = df.columns.to_list()
col_dict ={}
for i,row in df.iterrows():
for col in row.to_dict().keys():
if current[col] != row[col]:
for c in col_arr[col_arr.index(col):]:
current[c] = row[c]
newdf = add_column(c,row[c],df[lambda x: x[c]==row[c]].shape[0],newdf)
结果将是newdf:
country county city street count
0 country 1 NaN NaN NaN 3.0
1 NaN county 1 NaN NaN 3.0
2 NaN NaN city 1 NaN 2.0
3 NaN NaN NaN street 1 1.0
4 NaN NaN NaN street 2 1.0
5 NaN NaN city 2 NaN 1.0
6 NaN NaN NaN street 3 1.0
7 country 2 NaN NaN NaN 2.0
8 NaN county 2 NaN NaN 2.0
9 NaN NaN city 3 NaN 2.0
10 NaN NaN NaN street 4 1.0
11 NaN NaN NaN street 5 1.0
12 country 3 NaN NaN NaN 5.0
13 NaN county 3 NaN NaN 1.0
14 NaN NaN city 4 NaN 1.0
15 NaN NaN NaN street 6 1.0
16 NaN county 4 NaN NaN 4.0
17 NaN NaN city 5 NaN 1.0
18 NaN NaN NaN street 7 1.0
19 NaN NaN city 6 NaN 3.0
20 NaN NaN NaN street 8 1.0
21 NaN NaN NaN street 9 1.0
22 NaN NaN NaN street 10 1.0
您可以首先迭代原始数据帧的行,并创建偏移列表。然后,使用此偏移列表,可以创建具有所需格式的新数据帧 这是您的数据帧
countries = ['Country %i' % ind for ind in [1,1,1,2,2,3,3,3,3,3]]
counties = ['County %i' % ind for ind in [1,1,1,2,2,3,4,4,4,4]]
cites = ['City %i' % ind for ind in [1,1,2,3,3,4,5,6,6,6]]
streets = ['Street %i' % ind for ind in [1,2,3,4,5,6,7,8,9,10]]
test_df = pd.DataFrame({'Country':countries,'County':counties,'City':cites,'Street':streets})
要创建偏移列表,需要
col_names = test_df.columns.tolist()
col_inds = [[] for _ in range(len(col_names))]
col_offsets = np.arange(len(col_names))
row_buff = None
# iterate over rows of the dataframe
# the array col_offsets of a row records the indices of the row entries in the output dataframe
for ind,row in test_df.iterrows():
# if it is the first row, the col_offsets is clearly [0,1,2,3]
if ind == 0:
row_buff = row.values
for ind_offset,offset in enumerate(col_offsets):
col_inds[ind_offset].append(offset)
else:
# if it is not the first row,
# find from which column the current row is different from the last row
diff_ind = np.argmax(~(row.values==row_buff))
last_offset = col_offsets.copy()
for disp_count,disp_ind in enumerate(range(diff_ind,len(col_names))):
col_offsets[disp_ind] = last_offset[-1] + disp_count + 1
col_inds[disp_ind].append(col_offsets[disp_ind])
row_buff = row.values
偏移列表为列
,在本例中为
[[0, 7, 12],
[1, 8, 13, 16],
[2, 5, 9, 14, 17, 19],
[3, 4, 6, 10, 11, 15, 18, 20, 21, 22]]
以第一个子列表为例,Country 1
、Country 2
和Country 3
的行索引应在输出数据帧的第一列中为0
、7
和12
使用此列
,您可以使用
max_ind = max([inner for item in col_inds for inner in item])
# initialize an empty string array
placeholder = np.empty((max_ind+1,len(col_names)),dtype='S10')
placeholder[:,:] = ''
for ind,item in enumerate(col_inds):
if ind < len(col_inds) - 1:
df_col = test_df.groupby(col_names[:ind+1],as_index=False).agg({col_names[ind+1]:'count'}).iloc[:,ind]
else:
df_col = test_df.iloc[:,ind]
for val_ind,val_pos in enumerate(item):
placeholder[val_pos,ind] = df_col[val_ind]
# create new dataframe
new_df = pd.DataFrame({name:val for name,val in zip(col_names,placeholder.T)})
for col in col_names:
new_df[col] = new_df[col].str.decode('utf-8')
max\u ind=max([col\u inds for item in coll\u inds for internal in item])
#初始化空字符串数组
占位符=np.empty((max_ind+1,len(col_name)),dtype='S10')
占位符[:,:]=“”
对于ind,枚举中的项目(列ind):
如果ind
您可以迭代列本身,并依靠DataFrame.value\u counts()
获得不同嵌套级别的计数。在执行此操作时,您需要对索引进行调整,以使所有内容稍后正确对齐,但最终您只需使用pd.concat
将这些块粘在一起:
chunk_计数=[]
对于测试列中的列:
计数=测试计数。位置[:,:列]。值计数()
n\u empty\u levels=测试方向栏.size-测试方向栏.get\u loc(col)-1
空\u级别=[[“”]]*n\u空\u级别
新的_级别=[*counts.index.levels,*空的_级别]
新索引=pd.MULTINDEX.from产品(新级别,名称=测试列)
chunk\u counts.append(counts.reindex(新索引))
最终_系列=(pd.concat(块_计数)
.sort_index()
.dropna()
.astype(int)
。重命名(“计数”))
如果打印(最终系列)
,则报告显示良好,但多索引在每个嵌套级别下没有空条目(与多索引
的显示方式一样。当我们使用重置索引
时,这一点变得很明显。要将我们的系列放回一个框架,需要保持OP请求的格式,我们需要再做一些调整
index_cols = final_series.index.names
final_df = final_series.reset_index()
final_df[index_cols] = final_df[index_cols].where(~final_df[index_cols].apply(pd.Series.duplicated))
final_df = final_df.fillna("")
print(final_df)
Country County City Street count
0 Country 1 3
1 County 1 3
2 City 1 2
3 Street 1 1
4 Street 2 1
5 City 2 1
6 Street 3 1
7 Country 2 2
8 County 2 2
9 City 3 2
10 Street 4 1
11 Street 5 1
12 Country 3 5
13 County 3 1
14 City 4 1
15 Street 6 1
16 County 4 4
17 City 5 1
18 Street 7 1
19 City 6 3
20 Street 10 1
21 Street 8 1
22 Street 9 1
这是另一个版本和一个变体。第一个版本稍微简单一些,但打印结果与您指定的不完全一样,但它应该很快,因为它只需要四个groupby语句一个concat和一个sort_索引。第二个版本是对第一个版本的修改,应该仍然很快,但输出看起来像您的问题:
cols= ['country', 'county', 'city', 'street']
dfs_result= list()
group_key= list()
cols_left= cols
while cols_left:
count_key, *cols_left= cols_left
group_key.append(count_key)
df_counts= df.groupby(group_key).agg(count=(count_key, 'count'))
df_counts.reset_index(drop=False, inplace=True)
for col in cols:
if col not in df_counts.columns:
# create a fake column with NAs, so we have the same structure for
# all count dfs
df_counts[col]= np.NaN
df_counts.set_index(cols, inplace=True)
dfs_result.append(df_counts)
df_result= pd.concat(dfs_result, axis='index')
df_result.sort_index(inplace=True)
df_result
结果如下所示:
country county city street count
0 country 1 county 1 city 1 street 1 1
1 country 1 county 1 city 1 street 2 1
2 country 1 county 1 city 1 NaN 2
3 country 1 county 1 city 2 street 3 1
4 country 1 county 1 city 2 NaN 1
5 country 1 county 1 NaN NaN 3
6 country 1 NaN NaN NaN 3
7 country 2 county 2 city 3 street 4 1
8 country 2 county 2 city 3 street 5 1
9 country 2 county 2 city 3 NaN 2
10 country 2 county 2 NaN NaN 2
11 country 2 NaN NaN NaN 2
12 country 3 county 3 city 4 street 6 1
13 country 3 county 3 city 4 NaN 1
14 country 3 county 3 NaN NaN 1
15 country 3 county 4 city 5 street 7 1
16 country 3 county 4 city 5 NaN 1
17 country 3 county 4 city 6 street 10 1
18 country 3 county 4 city 6 street 8 1
19 country 3 county 4 city 6 street 9 1
20 country 3 county 4 city 6 NaN 3
21 country 3 county 4 NaN NaN 4
22 country 3 NaN NaN NaN 5
只需稍加修改,您就可以获得您想要的输出:
dfs_result= list()
group_key= list()
cols_left= cols
index_cols= [f'i_{col}' for col in cols]
while cols_left:
count_key, *cols_left= cols_left
group_key.append(count_key)
df_counts= df.groupby(group_key).agg(count=(count_key, 'count'))
df_counts.reset_index(drop=False, inplace=True)
# create temporary index columns and a level column
df_counts['level']= len(group_key)
for col in cols:
if col not in df_counts.columns:
# create a fake column with NAs, so we have the same structure for
# all count dfs
df_counts[f'i_{col}']= ''
else:
df_counts[f'i_{col}']= df_counts[col]
if col != count_key:
df_counts[col]= ''
# set the new index (you could also just sort at the end
# for these columns if you like that better
df_counts.set_index(index_cols + ['level'], inplace=True)
dfs_result.append(df_counts)
# concat the partial counts, sort and then drop the
# temporary index
df_result= pd.concat(dfs_result, axis='index')
df_result.sort_index(inplace=True, axis='index')
df_result.reset_index(drop=True, inplace=True)
df_result[cols + ['count']]
最终结果如下所示:
country county city street count
0 country 1 3
1 county 1 3
2 city 1 2
3 street 1 1
4 street 2 1
5 city 2 1
6 street 3 1
7 country 2 2
8 county 2 2
9 city 3 2
10 street 4 1
11 street 5 1
12 country 3 5
13 county 3 1
14 city 4 1
15 street 6 1
16 county 4 4
17 city 5 1
18 street 7 1
19 city 6 3
20 street 10 1
21 street 8 1
22 street 9 1
也许这会有帮助,谢谢@woblob,我正在使用一种类似的方法来计数。主要问题是行的增加。我得到的错误是,
'DataFrame'对象没有属性“value\u counts”
在线counts=test\u df.loc[:,:col]。value\u counts()
country county city street count
0 country 1 3
1 county 1 3
2 city 1 2
3 street 1 1
4 street 2 1
5 city 2 1
6 street 3 1
7 country 2 2
8 county 2 2
9 city 3 2
10 street 4 1
11 street 5 1
12 country 3 5
13 county 3 1
14 city 4 1
15 street 6 1
16 county 4 4
17 city 5 1
18 street 7 1
19 city 6 3
20 street 10 1
21 street 8 1
22 street 9 1