Pandas 在数据帧的每个组中应用kmeans on,并将集群保存在同一数据帧的新列中
我在D列中有一个包含一些嵌入的数据框。我想首先按a列对数据进行分组,然后对每个组应用kmeans。每个组都可能包含NA值,因此在应用函数中,我将簇的数目看作列D中的非Na值的数目(2)(<代码> NYCopys= int(NothNavaMask.SUM())/ 2)< /> >。 在apply函数中,我返回Pandas 在数据帧的每个组中应用kmeans on,并将集群保存在同一数据帧的新列中,pandas,group-by,cluster-analysis,k-means,embedding,Pandas,Group By,Cluster Analysis,K Means,Embedding,我在D列中有一个包含一些嵌入的数据框。我想首先按a列对数据进行分组,然后对每个组应用kmeans。每个组都可能包含NA值,因此在应用函数中,我将簇的数目看作列D中的非Na值的数目(2)( NYCopys= int(NothNavaMask.SUM())/ 2)< /> >。 在apply函数中,我返回df['cluster'].values.tolist()。我打印了这些值,对于每个组都是正确的,但是在运行整个脚本之后,df_test['clusters']在所有行中只包含nan 示例数据帧:
df['cluster'].values.tolist()
。我打印了这些值,对于每个组都是正确的,但是在运行整个脚本之后,df_test['clusters']
在所有行中只包含nan
示例数据帧:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
我计算kmeans的方法:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
结果:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
做了一些细微的改变。变化的实质是使用
transform
而不是apply
。另外,不需要传递整个Grouper
df,您可以直接传递列D
,因为这是您使用的唯一列-
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
输出
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN