Python 在Scipy中计算p值和Z分数时的NAN值列表

Python 在Scipy中计算p值和Z分数时的NAN值列表,python,pandas,scipy,statistics,statsmodels,Python,Pandas,Scipy,Statistics,Statsmodels,我正在计算数据帧中不同子段的Z分数和p值 数据框有两列,以下是我的数据框中的前5个值: df[["Engagement_score", "Performance"]].head() Engagement_score Performance 0 6 0.0 1 5 0.0 2 7 66.3 3 3 0.0 4 11

我正在计算数据帧中不同子段的Z分数和p值

数据框有两列,以下是我的数据框中的前5个值:

df[["Engagement_score", "Performance"]].head()
   Engagement_score  Performance
0    6                 0.0
1    5                 0.0
2    7                 66.3
3    3                 0.0
4    11                0.0
以下是敬业度得分的分布:

以下是性能分布:

我根据参与度得分对数据帧进行分组,然后计算这些组的三个统计数据:

1) 平均绩效分数(sub_Average)和该组中的值数量(sub_预订)

2) 其他组的平均绩效分数(剩余平均值)和其他组的价值观数量(剩余预订)

总体性能分数和总体预订量是针对总体数据框计算的

这是我的代码

def stats_comparison(i):
    df.groupby(i)['Performance'].agg({
    'average': 'mean',
    'bookings': 'count'
    }).reset_index()
    cat = df.groupby(i)['Performance']\
        .agg({
            'sub_average': 'mean',
            'sub_bookings': 'count'
       }).reset_index()
    cat['overall_average'] = df['Performance'].mean()
    cat['overall_bookings'] = df['Performance'].count()
    cat['rest_bookings'] = cat['overall_bookings'] - cat['sub_bookings']
    cat['rest_average'] = (cat['overall_bookings']*cat['overall_average'] \
                     - cat['sub_bookings']*cat['sub_average'])/cat['rest_bookings']
    cat['z_score'] = (cat['sub_average']-cat['rest_average'])/\
        np.sqrt(cat['overall_average']*(1-cat['overall_average'])
            *(1/cat['sub_bookings']+1/cat['rest_bookings'])) 
    cat['prob'] = np.around(stats.norm.cdf(cat.z_score), decimals = 10) # this is the p value
    cat['significant'] = [(lambda x: 1 if x > 0.9 else -1 if x < 0.1 else 0)(i) for i in cat['prob']] 
    # if the p value is less than 0.1 then I can confidently say that the 2 samples are different. 
    print(cat)

stats_comparison('Engagement_score')
我不知道为什么我会在ZScore和p value列中得到一个NAN值列表。我的数据集中没有负值

在Jupyter笔记本中运行代码时,我还收到以下警告:

C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  after removing the cwd from sys.path.
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version


    C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in sqrt
      from ipykernel import kernelapp as app
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
      return (self.a < x) & (x < self.b)
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
      return (self.a < x) & (x < self.b)
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal
      cond2 = (x >= self.b) & cond0
C:\Users\User\Anaconda3\lib\site packages\ipykernel\u launcher.py:4:FutureWarning:在序列上使用dict进行聚合
已弃用,将在将来的版本中删除
从sys.path中删除cwd后。
C:\Users\User\Anaconda3\lib\site packages\ipykernel\u launcher.py:8:FutureWarning:在序列上使用dict进行聚合
已弃用,将在将来的版本中删除
C:\Users\User\Anaconda3\lib\site packages\ipykernel\u launcher.py:15:RuntimeWarning:在sqrt中遇到无效值
从ipykernel导入内核应用程序作为应用程序
C:\Users\User\Anaconda3\lib\site packages\scipy\stats\\u distn\u infrastructure.py:879:RuntimeWarning:在更大版本中遇到无效值
返回(自a=self.b)和cond0

np.sqrt(类别[‘总体平均值’]*(1-cat[‘总体平均值’)…
:考虑到总平均值大约为0,你的平方根下总是有一个负数34@ALollz那么,我该如何修改我的代码呢?@ALollz我已经包括了一个显示这两个变量分布的图像。你不能在这里使用
Z-test
。问题是你的子样本与整个样本相关,所以y您违反了
Z-test
的独立性假设。您应该使用两个样本来测试子样本和子样本以外的所有内容之间的差异。scipy函数应该完成整个计算,只需将
DataFrame
分为两组。@ALollz当我运行t-test时,我得到了sam所有组的t值和概率值。下面是我计算t值的方法:cat['t_value']=stats.ttest_ind(cat['sub_average',cat['rest_average'])[0]
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  after removing the cwd from sys.path.
C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version


    C:\Users\User\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: RuntimeWarning: invalid value encountered in sqrt
      from ipykernel import kernelapp as app
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
      return (self.a < x) & (x < self.b)
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
      return (self.a < x) & (x < self.b)
    C:\Users\User\Anaconda3\lib\site-packages\scipy\stats\_distn_infrastructure.py:1738: RuntimeWarning: invalid value encountered in greater_equal
      cond2 = (x >= self.b) & cond0