Python 计算卡方检验中使用的先前机会数

Python 计算卡方检验中使用的先前机会数,python,python-2.7,csv,chi-squared,Python,Python 2.7,Csv,Chi Squared,因此,我使用一个脚本来计算一个人在该行指定日期之前出现在列表中的次数,在第6列中出现1,并且还计算一个人(第7列)在该行指定日期之前出现在列表中的次数(注意,他们是按时间顺序排序的)(使用基于零的列引用) 示例数据集 我正在使用的代码 这将返回: 最终,我希望对我生成的百分比数据执行卡方检验。然而,目前我所希望实现的是能够计算和求和唯一数据类中任何一个人的分数概率(第2列)并将其作为一个新列附加到csv。我不确定我使用的代码是否可以作为一个整体代码进行编辑以实现这一点。如果您对如何最好地实现这一

因此,我使用一个脚本来计算一个人在该行指定日期之前出现在列表中的次数,在第6列中出现1,并且还计算一个人(第7列)在该行指定日期之前出现在列表中的次数(注意,他们是按时间顺序排序的)(使用基于零的列引用)

示例数据集 我正在使用的代码 这将返回: 最终,我希望对我生成的百分比数据执行卡方检验。然而,目前我所希望实现的是能够计算和求和唯一数据类中任何一个人的分数概率(第2列)并将其作为一个新列附加到csv。我不确定我使用的代码是否可以作为一个整体代码进行编辑以实现这一点。如果您对如何最好地实现这一点提出任何建设性建议或意见,我们将不胜感激

我期望的输出如下:
这并不是对您的问题的完整回答(因为您试图做的事情有点含糊不清),只是为了向您展示如何自然地适应这种计算;您还可以通过名称而不是索引来调用列

假设您有一个
test.csv
文件,如下所示:

date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
          date    x0         cls  x1     x2  x3  tag          name
0   02/01/2005  Data   Class xpv   4  11yo+   4    1  George Smith
1   02/01/2005  Data   Class xpv   4  11yo+   4    2     Ted James
2   02/01/2005  Data   Class xpv   4  11yo+   4    3    Emma Lilly
3   02/01/2005  Data   Class xpv   4  11yo+   4    5  George Smith
...
我为每一列指定了名称。您可以通过

import pandas as pd
df = pd.DataFrame.from_csv( 'test.csv', index_col=None )
df
将如下所示:

date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
          date    x0         cls  x1     x2  x3  tag          name
0   02/01/2005  Data   Class xpv   4  11yo+   4    1  George Smith
1   02/01/2005  Data   Class xpv   4  11yo+   4    2     Ted James
2   02/01/2005  Data   Class xpv   4  11yo+   4    3    Emma Lilly
3   02/01/2005  Data   Class xpv   4  11yo+   4    5  George Smith
...
我删除您未使用的列(这只是为了演示,您不必删除这些列)

现在,
df
如下所示:

          date         cls  tag          name
0   02/01/2005   Class xpv    1  George Smith
1   02/01/2005   Class xpv    2     Ted James
2   02/01/2005   Class xpv    3    Emma Lilly
3   02/01/2005   Class xpv    5  George Smith
...
          date         cls  tag          name  cnt1  cnt      frac
0   02/01/2005   Class xpv    1  George Smith     0    0  0.000000
1   02/01/2005   Class xpv    2     Ted James     0    0  0.000000
2   02/01/2005   Class xpv    3    Emma Lilly     0    0  0.000000
3   02/01/2005   Class xpv    5  George Smith     0    0  0.000000
4   02/01/2005   Class tn2    4  Tom Phillips     0    0  0.000000
5   03/01/2005   Class tn2    2  Tom Phillips     0    1  0.200000
6   03/01/2005   Class tn2    5  George Smith     1    2  0.400000
7   03/01/2005   Class tn2    3  Tom Phillips     0    1  0.200000
8   03/01/2005   Class tn2    1    Emma Lilly     0    1  0.200000
9   03/01/2005   Class tn2    6  George Smith     1    2  0.400000
10  04/01/2005   Class tn2    6     Ted James     0    1  0.200000
11  04/01/2005   Class tn2    3  Tom Phillips     0    3  0.600000
12  04/01/2005   Class tn2    2  George Smith     1    4  0.800000
13  04/01/2005   Class tn2    4  George Smith     1    4  0.800000
14  04/01/2005   Class tn2    1  George Smith     1    4  0.800000
15  04/01/2005   Class tn2    5  Tom Phillips     0    3  0.600000
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2  0.400000
17  05/01/2005  Class 22zn    1     Ted James     0    2  0.366667
18  05/01/2005  Class 22zn    2  George Smith     2    7  1.300000
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2  0.400000
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5  0.933333
假设您要查找每个人在每天之前的日期中出现的累计次数:

pv = df.pivot_table( cols='name',
                     rows='date',
                     values='tag',
                     aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
api文档(请参阅)包括每个方法的详细说明

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           0             0          0             0
03/01/2005           1             2          1             1
04/01/2005           2             4          1             3
05/01/2005           2             7          2             5
或者可以使用
groupby

df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
要执行相同的计算,但仅针对
标记==1
,可以执行以下操作

idx = df.tag == 1
pv1 = df[ idx ].pivot_table( cols='name',
                             rows='date',
                             values='tag',
                             aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
或者使用
groupby
语法:

df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
这将是:

date        Emma Lilly  George Smith  Ted James
02/01/2005           0             0          0
03/01/2005           0             1          0
04/01/2005           1             1          0
05/01/2005           1             2          0
为了填充这两个新列,我们编写了一个helper函数,以便在缺少值时返回到0:

def lookup( pivot_table, col, idx, fall_back=0 ):
    try:
        return pivot_table[ col ][ idx ]
    except KeyError:
        return fall_back

df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
我们得到:

          date         cls  tag          name  cnt1  cnt
0   02/01/2005   Class xpv    1  George Smith     0    0
1   02/01/2005   Class xpv    2     Ted James     0    0
2   02/01/2005   Class xpv    3    Emma Lilly     0    0
3   02/01/2005   Class xpv    5  George Smith     0    0
4   02/01/2005   Class tn2    4  Tom Phillips     0    0
5   03/01/2005   Class tn2    2  Tom Phillips     0    1
6   03/01/2005   Class tn2    5  George Smith     1    2
7   03/01/2005   Class tn2    3  Tom Phillips     0    1
8   03/01/2005   Class tn2    1    Emma Lilly     0    1
9   03/01/2005   Class tn2    6  George Smith     1    2
10  04/01/2005   Class tn2    6     Ted James     0    1
11  04/01/2005   Class tn2    3  Tom Phillips     0    3
12  04/01/2005   Class tn2    2  George Smith     1    4
13  04/01/2005   Class tn2    4  George Smith     1    4
14  04/01/2005   Class tn2    1  George Smith     1    4
15  04/01/2005   Class tn2    5  Tom Phillips     0    3
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2
17  05/01/2005  Class 22zn    1     Ted James     0    2
18  05/01/2005  Class 22zn    2  George Smith     2    7
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5
如果我知道你是如何计算最后一列的话,我可以继续说下去。例如,为什么“汤姆·菲利普斯”在第六行得到0.2

编辑:好的,让我们继续。我们需要了解每个人在每个日期出现的次数;这是另一个透视表:

appr = df.pivot_table( cols='name',
                       rows='date',
                       values='tag',
                       aggfunc=len ).fillna( 0 )

输出:

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           1             2          1             1
03/01/2005           1             2          0             2
04/01/2005           0             3          1             2
05/01/2005           2             1          1             1
date
02/01/2005    5
03/01/2005    5
04/01/2005    6
05/01/2005    5
以及每个日期出现多少人:

total_appr = appr.sum( axis=1 )
输出:

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           1             2          1             1
03/01/2005           1             2          0             2
04/01/2005           0             3          1             2
05/01/2005           2             1          1             1
date
02/01/2005    5
03/01/2005    5
04/01/2005    6
05/01/2005    5
要计算累积分数,您只需将每行除以总数,再除以一(因为我们查找以前的日期),然后计算累积总和:

frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( )
df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ]
现在,
df
如下所示:

          date         cls  tag          name
0   02/01/2005   Class xpv    1  George Smith
1   02/01/2005   Class xpv    2     Ted James
2   02/01/2005   Class xpv    3    Emma Lilly
3   02/01/2005   Class xpv    5  George Smith
...
          date         cls  tag          name  cnt1  cnt      frac
0   02/01/2005   Class xpv    1  George Smith     0    0  0.000000
1   02/01/2005   Class xpv    2     Ted James     0    0  0.000000
2   02/01/2005   Class xpv    3    Emma Lilly     0    0  0.000000
3   02/01/2005   Class xpv    5  George Smith     0    0  0.000000
4   02/01/2005   Class tn2    4  Tom Phillips     0    0  0.000000
5   03/01/2005   Class tn2    2  Tom Phillips     0    1  0.200000
6   03/01/2005   Class tn2    5  George Smith     1    2  0.400000
7   03/01/2005   Class tn2    3  Tom Phillips     0    1  0.200000
8   03/01/2005   Class tn2    1    Emma Lilly     0    1  0.200000
9   03/01/2005   Class tn2    6  George Smith     1    2  0.400000
10  04/01/2005   Class tn2    6     Ted James     0    1  0.200000
11  04/01/2005   Class tn2    3  Tom Phillips     0    3  0.600000
12  04/01/2005   Class tn2    2  George Smith     1    4  0.800000
13  04/01/2005   Class tn2    4  George Smith     1    4  0.800000
14  04/01/2005   Class tn2    1  George Smith     1    4  0.800000
15  04/01/2005   Class tn2    5  Tom Phillips     0    3  0.600000
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2  0.400000
17  05/01/2005  Class 22zn    1     Ted James     0    2  0.366667
18  05/01/2005  Class 22zn    2  George Smith     2    7  1.300000
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2  0.400000
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5  0.933333

在最后一列的两行中,我的数字与你的不同。因此,要么是我计算错了,要么是你计算错了这两个数字。

这不应该是你问题的完整答案(因为你试图做的事情有点模棱两可)但为了向您展示如何自然地适应这种计算,您还可以通过名称而不是索引来调用列

假设您有一个
test.csv
文件,如下所示:

date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
          date    x0         cls  x1     x2  x3  tag          name
0   02/01/2005  Data   Class xpv   4  11yo+   4    1  George Smith
1   02/01/2005  Data   Class xpv   4  11yo+   4    2     Ted James
2   02/01/2005  Data   Class xpv   4  11yo+   4    3    Emma Lilly
3   02/01/2005  Data   Class xpv   4  11yo+   4    5  George Smith
...
我为每一列指定了名称。您可以通过

import pandas as pd
df = pd.DataFrame.from_csv( 'test.csv', index_col=None )
df
将如下所示:

date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
          date    x0         cls  x1     x2  x3  tag          name
0   02/01/2005  Data   Class xpv   4  11yo+   4    1  George Smith
1   02/01/2005  Data   Class xpv   4  11yo+   4    2     Ted James
2   02/01/2005  Data   Class xpv   4  11yo+   4    3    Emma Lilly
3   02/01/2005  Data   Class xpv   4  11yo+   4    5  George Smith
...
我删除您未使用的列(这只是为了演示,您不必删除这些列)

现在,
df
如下所示:

          date         cls  tag          name
0   02/01/2005   Class xpv    1  George Smith
1   02/01/2005   Class xpv    2     Ted James
2   02/01/2005   Class xpv    3    Emma Lilly
3   02/01/2005   Class xpv    5  George Smith
...
          date         cls  tag          name  cnt1  cnt      frac
0   02/01/2005   Class xpv    1  George Smith     0    0  0.000000
1   02/01/2005   Class xpv    2     Ted James     0    0  0.000000
2   02/01/2005   Class xpv    3    Emma Lilly     0    0  0.000000
3   02/01/2005   Class xpv    5  George Smith     0    0  0.000000
4   02/01/2005   Class tn2    4  Tom Phillips     0    0  0.000000
5   03/01/2005   Class tn2    2  Tom Phillips     0    1  0.200000
6   03/01/2005   Class tn2    5  George Smith     1    2  0.400000
7   03/01/2005   Class tn2    3  Tom Phillips     0    1  0.200000
8   03/01/2005   Class tn2    1    Emma Lilly     0    1  0.200000
9   03/01/2005   Class tn2    6  George Smith     1    2  0.400000
10  04/01/2005   Class tn2    6     Ted James     0    1  0.200000
11  04/01/2005   Class tn2    3  Tom Phillips     0    3  0.600000
12  04/01/2005   Class tn2    2  George Smith     1    4  0.800000
13  04/01/2005   Class tn2    4  George Smith     1    4  0.800000
14  04/01/2005   Class tn2    1  George Smith     1    4  0.800000
15  04/01/2005   Class tn2    5  Tom Phillips     0    3  0.600000
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2  0.400000
17  05/01/2005  Class 22zn    1     Ted James     0    2  0.366667
18  05/01/2005  Class 22zn    2  George Smith     2    7  1.300000
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2  0.400000
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5  0.933333
假设您要查找每个人在每天之前的日期中出现的累计次数:

pv = df.pivot_table( cols='name',
                     rows='date',
                     values='tag',
                     aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
api文档(请参阅)包括每个方法的详细说明

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           0             0          0             0
03/01/2005           1             2          1             1
04/01/2005           2             4          1             3
05/01/2005           2             7          2             5
或者可以使用
groupby

df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
要执行相同的计算,但仅针对
标记==1
,可以执行以下操作

idx = df.tag == 1
pv1 = df[ idx ].pivot_table( cols='name',
                             rows='date',
                             values='tag',
                             aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
或者使用
groupby
语法:

df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
这将是:

date        Emma Lilly  George Smith  Ted James
02/01/2005           0             0          0
03/01/2005           0             1          0
04/01/2005           1             1          0
05/01/2005           1             2          0
为了填充这两个新列,我们编写了一个helper函数,以便在缺少值时返回到0:

def lookup( pivot_table, col, idx, fall_back=0 ):
    try:
        return pivot_table[ col ][ idx ]
    except KeyError:
        return fall_back

df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
我们得到:

          date         cls  tag          name  cnt1  cnt
0   02/01/2005   Class xpv    1  George Smith     0    0
1   02/01/2005   Class xpv    2     Ted James     0    0
2   02/01/2005   Class xpv    3    Emma Lilly     0    0
3   02/01/2005   Class xpv    5  George Smith     0    0
4   02/01/2005   Class tn2    4  Tom Phillips     0    0
5   03/01/2005   Class tn2    2  Tom Phillips     0    1
6   03/01/2005   Class tn2    5  George Smith     1    2
7   03/01/2005   Class tn2    3  Tom Phillips     0    1
8   03/01/2005   Class tn2    1    Emma Lilly     0    1
9   03/01/2005   Class tn2    6  George Smith     1    2
10  04/01/2005   Class tn2    6     Ted James     0    1
11  04/01/2005   Class tn2    3  Tom Phillips     0    3
12  04/01/2005   Class tn2    2  George Smith     1    4
13  04/01/2005   Class tn2    4  George Smith     1    4
14  04/01/2005   Class tn2    1  George Smith     1    4
15  04/01/2005   Class tn2    5  Tom Phillips     0    3
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2
17  05/01/2005  Class 22zn    1     Ted James     0    2
18  05/01/2005  Class 22zn    2  George Smith     2    7
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5
如果我知道你是如何计算最后一列的话,我可以继续说下去。例如,为什么“汤姆·菲利普斯”在第六行得到0.2

编辑:好的,让我们继续。我们需要了解每个人在每个日期出现的次数;这是另一个透视表:

appr = df.pivot_table( cols='name',
                       rows='date',
                       values='tag',
                       aggfunc=len ).fillna( 0 )

输出:

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           1             2          1             1
03/01/2005           1             2          0             2
04/01/2005           0             3          1             2
05/01/2005           2             1          1             1
date
02/01/2005    5
03/01/2005    5
04/01/2005    6
05/01/2005    5
以及每个日期出现多少人:

total_appr = appr.sum( axis=1 )
输出:

date        Emma Lilly  George Smith  Ted James  Tom Phillips
02/01/2005           1             2          1             1
03/01/2005           1             2          0             2
04/01/2005           0             3          1             2
05/01/2005           2             1          1             1
date
02/01/2005    5
03/01/2005    5
04/01/2005    6
05/01/2005    5
要计算累积分数,您只需将每行除以总数,再除以一(因为我们查找以前的日期),然后计算累积总和:

frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( )
df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ]
现在,
df
如下所示:

          date         cls  tag          name
0   02/01/2005   Class xpv    1  George Smith
1   02/01/2005   Class xpv    2     Ted James
2   02/01/2005   Class xpv    3    Emma Lilly
3   02/01/2005   Class xpv    5  George Smith
...
          date         cls  tag          name  cnt1  cnt      frac
0   02/01/2005   Class xpv    1  George Smith     0    0  0.000000
1   02/01/2005   Class xpv    2     Ted James     0    0  0.000000
2   02/01/2005   Class xpv    3    Emma Lilly     0    0  0.000000
3   02/01/2005   Class xpv    5  George Smith     0    0  0.000000
4   02/01/2005   Class tn2    4  Tom Phillips     0    0  0.000000
5   03/01/2005   Class tn2    2  Tom Phillips     0    1  0.200000
6   03/01/2005   Class tn2    5  George Smith     1    2  0.400000
7   03/01/2005   Class tn2    3  Tom Phillips     0    1  0.200000
8   03/01/2005   Class tn2    1    Emma Lilly     0    1  0.200000
9   03/01/2005   Class tn2    6  George Smith     1    2  0.400000
10  04/01/2005   Class tn2    6     Ted James     0    1  0.200000
11  04/01/2005   Class tn2    3  Tom Phillips     0    3  0.600000
12  04/01/2005   Class tn2    2  George Smith     1    4  0.800000
13  04/01/2005   Class tn2    4  George Smith     1    4  0.800000
14  04/01/2005   Class tn2    1  George Smith     1    4  0.800000
15  04/01/2005   Class tn2    5  Tom Phillips     0    3  0.600000
16  05/01/2005  Class 22zn    3    Emma Lilly     1    2  0.400000
17  05/01/2005  Class 22zn    1     Ted James     0    2  0.366667
18  05/01/2005  Class 22zn    2  George Smith     2    7  1.300000
19  05/01/2005  Class 22zn    4    Emma Lilly     1    2  0.400000
20  05/01/2005  Class 22zn    5  Tom Phillips     0    5  0.933333

在最后一列的两行中,我的数字与你的不同。因此,要么是我计算错了,要么是你计算错了这两个数字。

这应该很简单,只是不清楚你所说的“唯一数据类中任何一个人的分数概率”是什么意思。例如,对于数据类
xpv
,您的数据以5行开始,其中
George Smith
出现两次。您希望看到George Smith的“分数概率”是多少?其他人(出现一次)希望看到什么?为什么示例输出在
xpv
行旁边只显示零

答案可能取决于日期类是否会在以后的日期重复出现,以及这是否与您的计算有关;但如果您能解释如何计算前5个值,那么其余的可能会变得清晰。(如果不是,请解释第二组,其中的值确实变为非零。)

注:也许这在t中的讨论中得到了解决