Python Pandas groupby:为每个组的counts()获取最佳zscore

Python Pandas groupby:为每个组的counts()获取最佳zscore,python,pandas,group-by,statistics,Python,Pandas,Group By,Statistics,我有一个pandas groupby对象,它返回每种基因类型的计数,大致如下所示(为清晰起见,手动格式化了列标题): 我需要得到组内的zscore,然后返回具有最高zscore的基因 我尝试了以下方法,但它似乎在计算整个数据集的zscore,并且没有返回正确的zscore: zscore = lambda x: (x - x.mean()) / x.std() counts = df.groupby(["ID", "Match"]).size().pipe(zscore) 我尝试过转换,得到了

我有一个pandas groupby对象,它返回每种基因类型的计数,大致如下所示(为清晰起见,手动格式化了列标题):

我需要得到组内的zscore,然后返回具有最高zscore的基因

我尝试了以下方法,但它似乎在计算整个数据集的zscore,并且没有返回正确的zscore:

zscore = lambda x: (x - x.mean()) / x.std()
counts = df.groupby(["ID", "Match"]).size().pipe(zscore)
我尝试过转换,得到了同样的结果

我试过:

counts = match_df.groupby(["ID", "Match"]).size().apply(zscore)
这给了我以下错误:

'int' object has no attribute 'mean'
无论我尝试什么,它都不会给出正确的输出。前两行的zscore应该是[-1,1],在这种情况下,我将返回1_1_1 SMARCB1的行。等等,谢谢

更新 多亏@ZaxR的帮助,并切换到numpy mean和standard develope,我能够解决这个问题,如下所示。该解决方案还提供每个基因的原始计数和Z核心的摘要数据框:

# group by id and gene match and sum hits to each molecule
counts = df.groupby(["ID", "Match"]).size()

# calculate zscore by feature for molecule counts
# features that only align to one molecule are given a score of 1
zscore = lambda x: (x - np.mean(x)) / np.std(x) 
zscores = counts.groupby('ID').apply(zscore).fillna('1').to_frame('Zscore')

# group results back together with counts and output to 
# merge with positions and save to file 
zscore_df = zscores.reset_index()
zscore_df.columns = ["ID", "Match", "Zscore"]
count_df = counts.reset_index()
count_df.columns = ["ID", "Match", "Counts"]
zscore_df["Counts"] = count_df["Counts"]

# select gene with best zscore meeting threshold
max_df = zscore_df[zscore_df.groupby('ID')['Zscore'].transform(max) \
                       == zscore_df['Zscore']]
df.groupby([“ID”,“Gene”]).size().transform(zscore)
不起作用的原因是,最后一个组是一个只有一个项的序列,因此当您尝试将lambda函数zscore应用于单个[integer]时,您会得到
'int'对象没有属性“mean”
错误。请注意,x.mean()的行为与“mean”不同

更新 我认为这应该做到:

# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
                   "Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
                   "Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
                  Count   std_dev
ID       Gene
1_1_1    smad     12      0.707107
1_1_10   smad     17      0.707107

嗯,我要离开我的电脑,但请尝试
.groupby(['FeautreID','Match'],as_index=False).size().groupby(['FeatureID','Match'])。apply(zscore)
谢谢,但我需要首先获得计算zscore的计数。是的,我刚刚意识到,请尝试我的编辑(在修复了可能出现的任何打字错误后,我在手机上)谢谢您的快速编辑。我试图让它工作,但它只是返回南。谢谢你,但正如我上面提到的,转换并没有给出正确的答案。它似乎不是以团体为中心,而是以人口为中心。我不确定它到底在做什么,但返回的答案是不正确的。对不起,我在阅读时错过了这个。我补充了一个解释,解释了为什么转换方法不能作为参考。在一些测试之后,我的答案不是很理想,因为我丢失了基因信息。我还一直在试图找出如何为此更改zscore函数,以便如果组大小小于2,它将返回1,而不是尝试计算zscore:zscore=lambda x:stats.zscore(x)if len(x)>1 else 1,但这也不起作用:/感谢更新。标准偏差步骤只返回NaN:(非常感谢@ZaxR的帮助。您引导我找到了解决方案,我已在上面发布。
# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
                   "Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
                   "Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
                  Count   std_dev
ID       Gene
1_1_1    smad     12      0.707107
1_1_10   smad     17      0.707107