Numpy SciPy中的Tukey测试分组和绘图_Numpy_Matplotlib_Scipy_Anova

Numpy SciPy中的Tukey测试分组和绘图

numpy matplotlib

Numpy SciPy中的Tukey测试分组和绘图,numpy,matplotlib,scipy,anova,Numpy,Matplotlib,Scipy,Anova,我试图绘制Tukey测试的结果，但我很难根据p值将数据分组。我正试图复制它。我一直在使用SciPy单向方差分析测试和Tukey测试statsmodel，但不能以相同的方式完成这些组非常感谢您的帮助我也刚刚发现了这个我一直在努力做同样的事情。我找到一篇文章，告诉你如何给字母编码。 Hans-Peter Piepho（2004）《所有成对比较的基于字母表示的算法》，计算和图形统计杂志，13:2456-466，DOI:10.1198/1061860043515 编写代码有点棘手，因为您需要检查和复

我试图绘制Tukey测试的结果，但我很难根据p值将数据分组。我正试图复制它。我一直在使用SciPy单向方差分析测试和Tukey测试statsmodel，但不能以相同的方式完成这些组

非常感谢您的帮助

我也刚刚发现了这个

我一直在努力做同样的事情。我找到一篇文章，告诉你如何给字母编码。
Hans-Peter Piepho（2004）《所有成对比较的基于字母表示的算法》，计算和图形统计杂志，13:2456-466，DOI:10.1198/1061860043515

编写代码有点棘手，因为您需要检查和复制列，然后合并列。我试着给colde添加一些评论。我找到了一种方法，可以运行tukeyhsd，然后根据结果计算字母。应该可以将其转化为一个函数。或者希望是tukeyhsd的一部分。我的数据没有发布，但它是一列数据，然后是一列描述组的数据。我的团体是纽约市的五个行政区。您也可以在第一次更改注释并使用随机数据

# Read data.  Comment out the next ones to use random data.  
df=pd.read_excel('anova_test.xlsx')
#n=1000
#df = pd.DataFrame(columns=['Groups','Data'],index=np.arange(n))
#df['Groups']=np.random.randint(1, 4,size=n)
#df['Data']=df['Groups']*np.random.random_sample(size=n)


# define columns for data and then grouping
col_to_group='Groups'
col_for_data='Data'

#Now take teh data and regroup for anova
samples = [cols[1] for cols in df.groupby(col_to_group)[col_for_data]]    #I am not sure how this works but it makes an numpy array for each group 
f_val, p_val = stats.f_oneway(*samples)  # I am not sure what this star does but this passes all the numpy arrays correctly
#print('F value: {:.3f}, p value: {:.3f}\n'.format(f_val, p_val))

# this if statement can be uncommmented if you don't won't to go furhter with out p<0.05
#if p_val<0.05:    #If the p value is less than 0.05 it then does the tukey
mod = MultiComparison(df[col_for_data], df[col_to_group])
thsd=mod.tukeyhsd()
#print(mod.tukeyhsd())

#this is a function to do Piepho method.  AN Alogrithm for a letter based representation of al-pairwise comparisons.  
tot=len(thsd.groupsunique)
#make an empty dataframe that is a square matrix of size of the groups. #set first column to 1
df_ltr=pd.DataFrame(np.nan, index=np.arange(tot),columns=np.arange(tot))
df_ltr.iloc[:,0]=1
count=0
df_nms = pd.DataFrame('', index=np.arange(tot), columns=['names'])  # I make a dummy dataframe to put axis labels into.  sd stands for signifcant difference

for i in np.arange(tot):   #I loop through and make all pairwise comparisons. 
    for j in np.arange(i+1,tot):
        #print('i=',i,'j=',j,thsd.reject[count])
        if thsd.reject[count]==True:
            for cn in np.arange(tot):
                if df_ltr.iloc[i,cn]==1 and df_ltr.iloc[j,cn]==1: #If the column contains both i and j shift and duplicat
                    df_ltr=pd.concat([df_ltr.iloc[:,:cn+1],df_ltr.iloc[:,cn+1:].T.shift().T],axis=1)
                    df_ltr.iloc[:,cn+1]=df_ltr.iloc[:,cn]
                    df_ltr.iloc[i,cn]=0
                    df_ltr.iloc[j,cn+1]=0
                #Now we need to check all columns for abosortpion.
                for cleft in np.arange(len(df_ltr.columns)-1):
                    for cright in np.arange(cleft+1,len(df_ltr.columns)):
                        if (df_ltr.iloc[:,cleft].isna()).all()==False and (df_ltr.iloc[:,cright].isna()).all()==False: 
                            if (df_ltr.iloc[:,cleft]>=df_ltr.iloc[:,cright]).all()==True:  
                                df_ltr.iloc[:,cright]=0
                                df_ltr=pd.concat([df_ltr.iloc[:,:cright],df_ltr.iloc[:,cright:].T.shift(-1).T],axis=1)
                            if (df_ltr.iloc[:,cleft]<=df_ltr.iloc[:,cright]).all()==True:
                                df_ltr.iloc[:,cleft]=0
                                df_ltr=pd.concat([df_ltr.iloc[:,:cleft],df_ltr.iloc[:,cleft:].T.shift(-1).T],axis=1)

        count+=1

#I sort so that the first column becomes A        
df_ltr=df_ltr.sort_values(by=list(df_ltr.columns),axis=1,ascending=False)

# I assign letters to each column
for cn in np.arange(len(df_ltr.columns)):
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(1,chr(97+cn)) 
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(0,'')
    df_ltr.iloc[:,cn]=df_ltr.iloc[:,cn].replace(np.nan,'') 

#I put all the letters into one string
df_ltr=df_ltr.astype(str)
df_ltr.sum(axis=1)
#print(df_ltr)
#print('\n')
#print(df_ltr.sum(axis=1))

#Now to plot like R with a violing plot
fig,ax=plt.subplots()
df.boxplot(column=col_for_data, by=col_to_group,ax=ax,fontsize=16,showmeans=True
                    ,boxprops=dict(linewidth=2.0),whiskerprops=dict(linewidth=2.0))  #This makes the boxplot

ax.set_ylim([-10,20])

grps=pd.unique(df[col_to_group].values)   #Finds the group names
grps.sort() # This is critical!  Puts the groups in alphabeical order to make it match the plotting

props=dict(facecolor='white',alpha=1)
for i,grp in enumerate(grps):   #I loop through the groups to make the scatters and figure out the axis labels. 

    x = np.random.normal(i+1, 0.15, size=len(df[df[col_to_group]==grp][col_for_data]))
    ax.scatter(x,df[df[col_to_group]==grp][col_for_data],alpha=0.5,s=2)
    name="{}\navg={:0.2f}\n(n={})".format(grp
                            ,df[df[col_to_group]==grp][col_for_data].mean()
                            ,df[df[col_to_group]==grp][col_for_data].count())
    df_nms['names'][i]=name 
    ax.text(i+1,ax.get_ylim()[1]*1.1,df_ltr.sum(axis=1)[i],fontsize=10,verticalalignment='top',horizontalalignment='center',bbox=props)


ax.set_xticklabels(df_nms['names'],rotation=0,fontsize=10)
ax.set_title('')
fig.suptitle('')

fig.savefig('anovatest.jpg',dpi=600,bbox_inches='tight')

#读取数据。注释出下一个，以使用随机数据。
df=pd.read\u excel（'anova\u test.xlsx'）
#n=1000
#df=pd.DataFrame（列=['Groups'，'Data']，索引=np.arange（n））
#df['Groups']=np.random.randint（1,4，size=n）
#df['Data']=df['Groups']*np.random.random_样本（大小=n）
#定义数据列，然后分组
colu to_group='Groups'
col_表示_data='data'
#现在获取数据并重新分组进行方差分析
samples=[cols[1]表示df.groupby中的cols（col_to_group）[col_表示_数据]]。#我不确定这是如何工作的，但它为每个组生成一个numpy数组
f_val，p_val=stats.f_one way（*样本）#我不确定这颗星是做什么的，但它正确地通过了所有numpy数组
#打印（'F值：{.3f}，p值：{.3f}\n'。格式（F值，p值））
#如果您不想在没有pIs的情况下继续使用此if语句，则可以取消注释。有一条一般规则如何分配“混合”成员身份，如第二个链接中的a，b
？（很久以前，我尝试在statsmodels沙盒中进行一些分组，但后来放弃了，因为没有分组来划分这些组。a与B没有显著差异，B与C没有显著差异。但是a和C有显著差异。因此没有传递性。）R包中有字母赋值的参考：Piepho，Hans Peter（2004）“基于字母表示的所有成对比较的算法”我受BrianM的启发，使用同一篇论文作为方法的来源制作了自己的版本。我的方法也可用于scikit_posthocs，包括蒙特卡罗优化步骤，以删除更多不必要的字母；在报纸上被描述为“扫荡”。这是用Python编写的