Pandas 如何增加每列/组的索引

Pandas 如何增加每列/组的索引,pandas,dataframe,pandas-groupby,Pandas,Dataframe,Pandas Groupby,我需要从以下位置格式化数据帧: | country | county | city | street | |-----------|----------|--------|-----------| | country 1 | county 1 | city 1 | street 1 | | country 1 | county 1 | city 1 | street 2 | | country 1 | county 1 | city 2 | street 3 | | co

我需要从以下位置格式化数据帧:

| country   | county   | city   | street    |
|-----------|----------|--------|-----------|
| country 1 | county 1 | city 1 | street 1  |
| country 1 | county 1 | city 1 | street 2  |
| country 1 | county 1 | city 2 | street 3  |
| country 2 | county 2 | city 3 | street 4  |
| country 2 | county 2 | city 3 | street 5  |
| country 3 | county 3 | city 4 | street 6  |
| country 3 | county 4 | city 5 | street 7  |
| country 3 | county 4 | city 6 | street 8  |
| country 3 | county 4 | city 6 | street 9  |
| country 3 | county 4 | city 6 | street 10 |

列的数量可能会有所不同


我正在使用multiple管理计数,并尝试用python进行格式化,但没有成功。有一种方法可以只使用pandas?

我提出的解决方案适用于任意数量的列,但对于非常大的数据帧来说,它不是一个完美的解决方案:

def add_column(col_name,col_value,count,tempdf):
    return tempdf.append({col_name:col_value,'count':count},ignore_index=True)

newdf = pd.DataFrame()
col_arr = df.columns.to_list()
col_dict ={}
for i,row in df.iterrows():
    for col in row.to_dict().keys():
        if current[col] != row[col]:
            for c in col_arr[col_arr.index(col):]:
                current[c] = row[c]
                newdf = add_column(c,row[c],df[lambda x: x[c]==row[c]].shape[0],newdf)
结果将是newdf:

      country    county    city     street  count
0   country 1       NaN     NaN        NaN    3.0
1         NaN  county 1     NaN        NaN    3.0
2         NaN       NaN  city 1        NaN    2.0
3         NaN       NaN     NaN   street 1    1.0
4         NaN       NaN     NaN   street 2    1.0
5         NaN       NaN  city 2        NaN    1.0
6         NaN       NaN     NaN   street 3    1.0
7   country 2       NaN     NaN        NaN    2.0
8         NaN  county 2     NaN        NaN    2.0
9         NaN       NaN  city 3        NaN    2.0
10        NaN       NaN     NaN   street 4    1.0
11        NaN       NaN     NaN   street 5    1.0
12  country 3       NaN     NaN        NaN    5.0
13        NaN  county 3     NaN        NaN    1.0
14        NaN       NaN  city 4        NaN    1.0
15        NaN       NaN     NaN   street 6    1.0
16        NaN  county 4     NaN        NaN    4.0
17        NaN       NaN  city 5        NaN    1.0
18        NaN       NaN     NaN   street 7    1.0
19        NaN       NaN  city 6        NaN    3.0
20        NaN       NaN     NaN   street 8    1.0
21        NaN       NaN     NaN   street 9    1.0
22        NaN       NaN     NaN  street 10    1.0

您可以首先迭代原始数据帧的行,并创建偏移列表。然后,使用此偏移列表,可以创建具有所需格式的新数据帧

这是您的数据帧

countries = ['Country %i' % ind for ind in [1,1,1,2,2,3,3,3,3,3]]
counties = ['County %i' % ind for ind in [1,1,1,2,2,3,4,4,4,4]]
cites = ['City %i' % ind for ind in [1,1,2,3,3,4,5,6,6,6]]
streets = ['Street %i' % ind for ind in [1,2,3,4,5,6,7,8,9,10]]

test_df = pd.DataFrame({'Country':countries,'County':counties,'City':cites,'Street':streets})
要创建偏移列表,需要

col_names = test_df.columns.tolist()

col_inds = [[] for _ in range(len(col_names))]
col_offsets = np.arange(len(col_names))
row_buff = None

# iterate over rows of the dataframe
# the array col_offsets of a row records the indices of the row entries in the output dataframe 
for ind,row in test_df.iterrows():

    # if it is the first row, the col_offsets is clearly [0,1,2,3]
    if ind == 0:
        row_buff = row.values
        for ind_offset,offset in enumerate(col_offsets):
            col_inds[ind_offset].append(offset)
    else:
        # if it is not the first row, 
        # find from which column the current row is different from the last row
        diff_ind = np.argmax(~(row.values==row_buff))
        last_offset = col_offsets.copy()
        for disp_count,disp_ind in enumerate(range(diff_ind,len(col_names))):
            col_offsets[disp_ind] = last_offset[-1] + disp_count + 1
            col_inds[disp_ind].append(col_offsets[disp_ind])
            
        row_buff = row.values
偏移列表为
,在本例中为

[[0, 7, 12],
 [1, 8, 13, 16],
 [2, 5, 9, 14, 17, 19],
 [3, 4, 6, 10, 11, 15, 18, 20, 21, 22]]
以第一个子列表为例,
Country 1
Country 2
Country 3
的行索引应在输出数据帧的第一列中为
0
7
12

使用此
,您可以使用

max_ind = max([inner for item in col_inds for inner in item])
# initialize an empty string array 
placeholder = np.empty((max_ind+1,len(col_names)),dtype='S10')
placeholder[:,:] = ''

for ind,item in enumerate(col_inds):
    
    if ind < len(col_inds) - 1:
        df_col = test_df.groupby(col_names[:ind+1],as_index=False).agg({col_names[ind+1]:'count'}).iloc[:,ind]
    else:
        df_col = test_df.iloc[:,ind]
        
    for val_ind,val_pos in enumerate(item):
        placeholder[val_pos,ind] = df_col[val_ind]

# create new dataframe
new_df = pd.DataFrame({name:val for name,val in zip(col_names,placeholder.T)})
for col in col_names:
    new_df[col] = new_df[col].str.decode('utf-8') 
max\u ind=max([col\u inds for item in coll\u inds for internal in item])
#初始化空字符串数组
占位符=np.empty((max_ind+1,len(col_name)),dtype='S10')
占位符[:,:]=“”
对于ind,枚举中的项目(列ind):
如果ind
您可以迭代列本身,并依靠
DataFrame.value\u counts()
获得不同嵌套级别的计数。在执行此操作时,您需要对索引进行调整,以使所有内容稍后正确对齐,但最终您只需使用
pd.concat
将这些块粘在一起:

chunk_计数=[]
对于测试列中的列:
计数=测试计数。位置[:,:列]。值计数()
n\u empty\u levels=测试方向栏.size-测试方向栏.get\u loc(col)-1
空\u级别=[[“”]]*n\u空\u级别
新的_级别=[*counts.index.levels,*空的_级别]
新索引=pd.MULTINDEX.from产品(新级别,名称=测试列)
chunk\u counts.append(counts.reindex(新索引))
最终_系列=(pd.concat(块_计数)
.sort_index()
.dropna()
.astype(int)
。重命名(“计数”))
如果
打印(最终系列)
,则报告显示良好,但多索引在每个嵌套级别下没有空条目(与
多索引
的显示方式一样。当我们使用
重置索引
时,这一点变得很明显。要将我们的系列放回一个框架,需要保持OP请求的格式,我们需要再做一些调整

index_cols = final_series.index.names
final_df = final_series.reset_index()
final_df[index_cols] = final_df[index_cols].where(~final_df[index_cols].apply(pd.Series.duplicated))
final_df = final_df.fillna("")

print(final_df)

      Country    County    City     Street  count
0   Country 1                                   3
1              County 1                         3
2                        City 1                 2
3                                 Street 1      1
4                                 Street 2      1
5                        City 2                 1
6                                 Street 3      1
7   Country 2                                   2
8              County 2                         2
9                        City 3                 2
10                                Street 4      1
11                                Street 5      1
12  Country 3                                   5
13             County 3                         1
14                       City 4                 1
15                                Street 6      1
16             County 4                         4
17                       City 5                 1
18                                Street 7      1
19                       City 6                 3
20                               Street 10      1
21                                Street 8      1
22                                Street 9      1

这是另一个版本和一个变体。第一个版本稍微简单一些,但打印结果与您指定的不完全一样,但它应该很快,因为它只需要四个groupby语句一个concat和一个sort_索引。第二个版本是对第一个版本的修改,应该仍然很快,但输出看起来像您的问题:

cols= ['country', 'county', 'city', 'street']
dfs_result= list()
group_key= list()
cols_left= cols
while cols_left:
    count_key, *cols_left= cols_left
    group_key.append(count_key)
    df_counts= df.groupby(group_key).agg(count=(count_key, 'count'))
    df_counts.reset_index(drop=False, inplace=True)
    for col in cols:
        if col not in df_counts.columns:
            # create a fake column with NAs, so we have the same structure for
            # all count dfs
            df_counts[col]= np.NaN
    df_counts.set_index(cols, inplace=True)
    dfs_result.append(df_counts)

df_result= pd.concat(dfs_result, axis='index')
df_result.sort_index(inplace=True)
df_result
结果如下所示:

      country    county    city     street  count
0   country 1  county 1  city 1   street 1      1
1   country 1  county 1  city 1   street 2      1
2   country 1  county 1  city 1        NaN      2
3   country 1  county 1  city 2   street 3      1
4   country 1  county 1  city 2        NaN      1
5   country 1  county 1     NaN        NaN      3
6   country 1       NaN     NaN        NaN      3
7   country 2  county 2  city 3   street 4      1
8   country 2  county 2  city 3   street 5      1
9   country 2  county 2  city 3        NaN      2
10  country 2  county 2     NaN        NaN      2
11  country 2       NaN     NaN        NaN      2
12  country 3  county 3  city 4   street 6      1
13  country 3  county 3  city 4        NaN      1
14  country 3  county 3     NaN        NaN      1
15  country 3  county 4  city 5   street 7      1
16  country 3  county 4  city 5        NaN      1
17  country 3  county 4  city 6  street 10      1
18  country 3  county 4  city 6   street 8      1
19  country 3  county 4  city 6   street 9      1
20  country 3  county 4  city 6        NaN      3
21  country 3  county 4     NaN        NaN      4
22  country 3       NaN     NaN        NaN      5
只需稍加修改,您就可以获得您想要的输出:

dfs_result= list()
group_key= list()
cols_left= cols
index_cols= [f'i_{col}' for col in cols]
while cols_left:
    count_key, *cols_left= cols_left
    group_key.append(count_key)
    df_counts= df.groupby(group_key).agg(count=(count_key, 'count'))
    df_counts.reset_index(drop=False, inplace=True)
    # create temporary index columns and a level column
    df_counts['level']= len(group_key)
    for col in cols:
        if col not in df_counts.columns:
            # create a fake column with NAs, so we have the same structure for
            # all count dfs
            df_counts[f'i_{col}']= ''
        else:
            df_counts[f'i_{col}']= df_counts[col]
        if col != count_key:
            df_counts[col]= ''
    # set the new index (you could also just sort at the end
    # for these columns if you like that better
    df_counts.set_index(index_cols + ['level'], inplace=True)
    dfs_result.append(df_counts)

# concat the partial counts, sort and then drop the 
# temporary index 
df_result= pd.concat(dfs_result, axis='index')
df_result.sort_index(inplace=True, axis='index')
df_result.reset_index(drop=True, inplace=True)
df_result[cols + ['count']]
最终结果如下所示:

      country    county    city     street  count
0   country 1                                   3
1              county 1                         3
2                        city 1                 2
3                                 street 1      1
4                                 street 2      1
5                        city 2                 1
6                                 street 3      1
7   country 2                                   2
8              county 2                         2
9                        city 3                 2
10                                street 4      1
11                                street 5      1
12  country 3                                   5
13             county 3                         1
14                       city 4                 1
15                                street 6      1
16             county 4                         4
17                       city 5                 1
18                                street 7      1
19                       city 6                 3
20                               street 10      1
21                                street 8      1
22                                street 9      1

也许这会有帮助,谢谢@woblob,我正在使用一种类似的方法来计数。主要问题是行的增加。我得到的错误是,
'DataFrame'对象没有属性“value\u counts”
在线
counts=test\u df.loc[:,:col]。value\u counts()
      country    county    city     street  count
0   country 1                                   3
1              county 1                         3
2                        city 1                 2
3                                 street 1      1
4                                 street 2      1
5                        city 2                 1
6                                 street 3      1
7   country 2                                   2
8              county 2                         2
9                        city 3                 2
10                                street 4      1
11                                street 5      1
12  country 3                                   5
13             county 3                         1
14                       city 4                 1
15                                street 6      1
16             county 4                         4
17                       city 5                 1
18                                street 7      1
19                       city 6                 3
20                               street 10      1
21                                street 8      1
22                                street 9      1