Python Pandas groupby：为每个组的counts（）获取最佳zscore_Python_Pandas_Group By_Statistics

Python Pandas groupby：为每个组的counts（）获取最佳zscore

python pandas statistics

Python Pandas groupby：为每个组的counts（）获取最佳zscore,python,pandas,group-by,statistics,Python,Pandas,Group By,Statistics,我有一个pandas groupby对象，它返回每种基因类型的计数，大致如下所示（为清晰起见，手动格式化了列标题）：我需要得到组内的zscore，然后返回具有最高zscore的基因我尝试了以下方法，但它似乎在计算整个数据集的zscore，并且没有返回正确的zscore： zscore = lambda x: (x - x.mean()) / x.std() counts = df.groupby(["ID", "Match"]).size().pipe(zscore) 我尝试过转换，得到了

我有一个pandas groupby对象，它返回每种基因类型的计数，大致如下所示（为清晰起见，手动格式化了列标题）：

我需要得到组内的zscore，然后返回具有最高zscore的基因

我尝试了以下方法，但它似乎在计算整个数据集的zscore，并且没有返回正确的zscore：

zscore = lambda x: (x - x.mean()) / x.std()
counts = df.groupby(["ID", "Match"]).size().pipe(zscore)

我尝试过转换，得到了同样的结果

我试过：

counts = match_df.groupby(["ID", "Match"]).size().apply(zscore)

这给了我以下错误：

'int' object has no attribute 'mean'

无论我尝试什么，它都不会给出正确的输出。前两行的zscore应该是[-1,1]，在这种情况下，我将返回1_1_1 SMARCB1的行。等等，谢谢

更新多亏@ZaxR的帮助，并切换到numpy mean和standard develope，我能够解决这个问题，如下所示。该解决方案还提供每个基因的原始计数和Z核心的摘要数据框：

# group by id and gene match and sum hits to each molecule
counts = df.groupby(["ID", "Match"]).size()

# calculate zscore by feature for molecule counts
# features that only align to one molecule are given a score of 1
zscore = lambda x: (x - np.mean(x)) / np.std(x) 
zscores = counts.groupby('ID').apply(zscore).fillna('1').to_frame('Zscore')

# group results back together with counts and output to 
# merge with positions and save to file 
zscore_df = zscores.reset_index()
zscore_df.columns = ["ID", "Match", "Zscore"]
count_df = counts.reset_index()
count_df.columns = ["ID", "Match", "Counts"]
zscore_df["Counts"] = count_df["Counts"]

# select gene with best zscore meeting threshold
max_df = zscore_df[zscore_df.groupby('ID')['Zscore'].transform(max) \
                       == zscore_df['Zscore']]

df.groupby（[“ID”，“Gene”]）.size（）.transform（zscore）

不起作用的原因是，最后一个组是一个只有一个项的序列，因此当您尝试将lambda函数zscore应用于单个[integer]时，您会得到

'int'对象没有属性“mean”

错误。请注意，x.mean（）的行为与“mean”不同

更新我认为这应该做到：

# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
                   "Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
                   "Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
                  Count   std_dev
ID       Gene
1_1_1    smad     12      0.707107
1_1_10   smad     17      0.707107

嗯，我要离开我的电脑，但请尝试

.groupby（['FeautreID'，'Match']，as_index=False）.size（）.groupby（['FeatureID'，'Match']）。apply（zscore）

谢谢，但我需要首先获得计算zscore的计数。是的，我刚刚意识到，请尝试我的编辑（在修复了可能出现的任何打字错误后，我在手机上）谢谢您的快速编辑。我试图让它工作，但它只是返回南。谢谢你，但正如我上面提到的，转换并没有给出正确的答案。它似乎不是以团体为中心，而是以人口为中心。我不确定它到底在做什么，但返回的答案是不正确的。对不起，我在阅读时错过了这个。我补充了一个解释，解释了为什么转换方法不能作为参考。在一些测试之后，我的答案不是很理想，因为我丢失了基因信息。我还一直在试图找出如何为此更改zscore函数，以便如果组大小小于2，它将返回1，而不是尝试计算zscore:zscore=lambda x:stats.zscore（x）if len（x）>1 else 1，但这也不起作用：/感谢更新。标准偏差步骤只返回NaN:（非常感谢@ZaxR的帮助。您引导我找到了解决方案，我已在上面发布。

# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
                   "Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
                   "Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
                  Count   std_dev
ID       Gene
1_1_1    smad     12      0.707107
1_1_10   smad     17      0.707107