Pandas 在数据帧的每个组中应用kmeans on,并将集群保存在同一数据帧的新列中

Pandas 在数据帧的每个组中应用kmeans on,并将集群保存在同一数据帧的新列中,pandas,group-by,cluster-analysis,k-means,embedding,Pandas,Group By,Cluster Analysis,K Means,Embedding,我在D列中有一个包含一些嵌入的数据框。我想首先按a列对数据进行分组,然后对每个组应用kmeans。每个组都可能包含NA值,因此在应用函数中,我将簇的数目看作列D中的非Na值的数目(2)( NYCopys= int(NothNavaMask.SUM())/ 2)< /> >。 在apply函数中,我返回df['cluster'].values.tolist()。我打印了这些值,对于每个组都是正确的,但是在运行整个脚本之后,df_test['clusters']在所有行中只包含nan 示例数据帧:

我在D列中有一个包含一些嵌入的数据框。我想首先按a列对数据进行分组,然后对每个组应用kmeans。每个组都可能包含NA值,因此在应用函数中,我将簇的数目看作列D中的非Na值的数目(2)(<代码> NYCopys= int(NothNavaMask.SUM())/ 2)< /> >。 在apply函数中,我返回
df['cluster'].values.tolist()
。我打印了这些值,对于每个组都是正确的,但是在运行整个脚本之后,
df_test['clusters']
在所有行中只包含nan

示例数据帧:

df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
                       'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
                       'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
                             np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})

df_test:
    A   B   D
0   aa  1.0 [2, 0, 1, 5, 4, 0]
1   bb  2.0 NaN
2   aa  NaN [4, 7, 0, 1, 0, 2]
3   bb  4.0 [1.0, 1, 1, 2, 0, 5]
4   aa  6.0 NaN
5   bb  NaN [1, 6, 3, 2, 1, 9]
6   aa  7.0 [4, 2, 1, 0, 0, 0]
7   cc  8.0 [3, 5, 6, 8, 8, 0]
8   aa  NaN NaN
9   aa  1.0 NaN
10  bb  4.0 [2, 5, 1, 7, 4, 0]
11  bb  3.0 [4, 2, 0, 4, 0, 0]
12  bb  4.0 [1.0, 0, 1, 8, 0, 9]
13  cc  7.0 [1, 0, 7, 2, 1, 0]
14  bb  5.0 NaN
15  aa  7.0 [1, 1, 5, 0, 8, 0]
16  cc  9.0 [4, 1, 6, 1, 1, 0]
17  aa  NaN NaN
我计算kmeans的方法:

def apply_kmeans_on_each_category(df):
    
    not_na_mask = df['D'].notna()
    
    embedding = df[not_na_mask]['D']
    n_clusters = int(not_na_mask.sum()/2)
    
    if n_clusters > 1:
        df['cluster'] = np.nan
        kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
        df.loc[not_na_mask, 'cluster'] = kmeans.labels_
        return df['cluster'].values.tolist()
    else:
        return [np.nan] * len(df)

df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
结果:

df_test['clusters']:
0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
Name: clusters, dtype: object

做了一些细微的改变。变化的实质是使用
transform
而不是
apply
。另外,不需要传递整个
Grouper
df,您可以直接传递列
D
,因为这是您使用的唯一列-

def apply_kmeans_on_each_category(df):
    not_na_mask = df.notna()
    
    embedding = df.loc[not_na_mask]
    n_clusters = int(not_na_mask.sum()/2)

    op = pd.Series([np.nan] * len(df), index=df.index)
    if n_clusters > 1:
        df['cluster'] = np.nan
        kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
        op.loc[not_na_mask] = kmeans.labels_.tolist()
    return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
输出

    A   B   D   clusters
0   aa  1.0 [2, 0, 1, 5, 4, 0]  0.0
1   bb  2.0 NaN NaN
2   aa  NaN [4, 7, 0, 1, 0, 2]  1.0
3   bb  4.0 [1.0, 1, 1, 2, 0, 5]    0.0
4   aa  6.0 NaN NaN
5   bb  NaN [1, 6, 3, 2, 1, 9]  0.0
6   aa  7.0 [4, 2, 1, 0, 0, 0]  1.0
7   cc  8.0 [3, 5, 6, 8, 8, 0]  NaN
8   aa  NaN NaN NaN
9   aa  1.0 NaN NaN
10  bb  4.0 [2, 5, 1, 7, 4, 0]  1.0
11  bb  3.0 [4, 2, 0, 4, 0, 0]  1.0
12  bb  4.0 [1.0, 0, 1, 8, 0, 9]    0.0
13  cc  7.0 [1, 0, 7, 2, 1, 0]  NaN
14  bb  5.0 NaN NaN
15  aa  7.0 [1, 1, 5, 0, 8, 0]  0.0
16  cc  9.0 [4, 1, 6, 1, 1, 0]  NaN
17  aa  NaN NaN NaN