Python 在一个数据帧上使用来自另一个数据帧的组进行T测试_Python_Pandas_Scipy

Python 在一个数据帧上使用来自另一个数据帧的组进行T测试

python pandas

Python 在一个数据帧上使用来自另一个数据帧的组进行T测试,python,pandas,scipy,Python,Pandas,Scipy,目标： # Dataframe (df_cnv) that forms groups of columns (cells) either\ belonging to True or False for t-test. cnv = {'gene': ['x','y','z','n'], 'cell_a': [0,-1,0,-1], 'cell_b': [0,-1,-1,-1], 'cell_c': [-1,0,-1,0], 'cell

目标：

# Dataframe (df_cnv) that forms groups of columns (cells) either\ belonging to True or False for t-test.
cnv = {'gene': ['x','y','z','n'],
        'cell_a': [0,-1,0,-1],
        'cell_b': [0,-1,-1,-1],
        'cell_c': [-1,0,-1,0],
        'cell_d': [-1,0,-1,0],
        'cell_e': [-1,0,0,0]
       }
df_cnv = pd.DataFrame(cnv)
df_cnv.set_index('gene', inplace=True)
cnv_mask = df_cnv < 0
cnv_mask  # True values are negative (gene loss is True)

使用在另一个数据帧（df_cnv）中找到的组对数据帧（df_rna）执行t检验。减少测试数据框（df_rna）中t检验得分最高的行指数

代码示例：

# Dataframe (df_cnv) that forms groups of columns (cells) either\ belonging to True or False for t-test.
cnv = {'gene': ['x','y','z','n'],
        'cell_a': [0,-1,0,-1],
        'cell_b': [0,-1,-1,-1],
        'cell_c': [-1,0,-1,0],
        'cell_d': [-1,0,-1,0],
        'cell_e': [-1,0,0,0]
       }
df_cnv = pd.DataFrame(cnv)
df_cnv.set_index('gene', inplace=True)
cnv_mask = df_cnv < 0
cnv_mask  # True values are negative (gene loss is True)

结果:

df_report

         p_val     t_stat
gene                     
x     0.966863  0.0450988
y            1          0
z     0.141358   -1.98508
n            0        inf

首先，我将转换两个DF，并为t-test结果设置一个新的DF：

cnv_mask_t = cnv_mask.transpose()
df_rna_t = df_rna.transpose()
df_tres = pd.dataframe(index=df_rna.index, columns=['pval', 'stat'])

然后，您可以迭代现在是列的基因，并过滤掩码中包含True的值：

for gene in df_rna_t:
    col_mask = cnv_mask_t[gene]
    tres = scipy.stats.ttest_ind(df_rna_t[gene][col_mask], df_rna_t[gene][~col_mask])
    df_tres.loc[gene] = [tres.pvalue, tres.statistic]

我想你可以从这里开始。

p.S.我目前使用手机，因此无法测试代码。如果你需要更多的帮助，请告诉我，当我接触到电脑时，我一定会调查的。非常干净，非常清晰。转置是唯一的问题，因为我正在处理大约3000列乘20000行的多个数据帧。否则很容易理解，但是波浪线在

[~col_mask]

中的作用是什么？它否定布尔向量，允许选择掩码为False的所有单元格。@Thomas Matthew，如果您想摆脱换位，请参阅我的答案。此解决方案挂起，并抛出一个警告

C:\Users\test\Anaconda3\lib\site packages\numpy\core\\u methods.py:82:RuntimeWarning:Degrees of freedom我没有注意到您发布了回溯。您能检查一下，将~cnv_mask应用于rnadf_all:rnadf_all[~cnv_mask]需要多长时间吗？对于大数据帧，我的解决方案不是最优的，因为这将在循环的每个迭代中完成。因此，最好在循环之前执行一次并缓存结果：not_rnadf_all=rnadf_all[~cnv_mask]；然后在循环中将rnadf_all[~cnv_mask].loc[r[0]].dropna（）替换为not_rnadf_all.loc[r[0].dropna（）。我在15小时前将每个示例（最优和次优）提交到我们学校的计算集群中。最佳的一个在00:02:01 CPU时间内完成（结果似乎合理），最大vmem为5.174G。较不理想的解决方案似乎仍在运行……次优解决方案以15:09:03的CPU时间完成，最大vmem为5.112G。我将编辑您的解决方案以反映最佳版本。再次感谢。当我看到你在DF中对20000行的评论时，我应该已经想到了。对于样本数据的5x4矩阵，这不是一个问题，但如果使用cnv_掩码过滤整个DF需要4,5秒，那么对于20000行，我们将在循环中分别为每行花费15小时。愚蠢的错误。
 from scipy import stats

 # Create empty DF for t-test results
 df_report = pd.DataFrame(index=df_rna.index, columns=['p_val', 't_stat'])

 not_df_rna = df_rna[~cnv_mask]

 # Iterate through df_rna rows, apply mask, drop NaN values, run ttest_ind and save result to df_report
 for r in df_rna[cnv_mask].iterrows():
     df_report.at[r[0], 't_stat'], df_report.at[r[0], 'p_val'] = stats.ttest_ind(r[1].dropna(), not_df_rna.loc[r[0]].dropna())

df_report

         p_val     t_stat
gene                     
x     0.966863  0.0450988
y            1          0
z     0.141358   -1.98508
n            0        inf

cnv_mask_t = cnv_mask.transpose()
df_rna_t = df_rna.transpose()
df_tres = pd.dataframe(index=df_rna.index, columns=['pval', 'stat'])

for gene in df_rna_t:
    col_mask = cnv_mask_t[gene]
    tres = scipy.stats.ttest_ind(df_rna_t[gene][col_mask], df_rna_t[gene][~col_mask])
    df_tres.loc[gene] = [tres.pvalue, tres.statistic]