Python 计算卡方检验中使用的先前机会数
因此,我使用一个脚本来计算一个人在该行指定日期之前出现在列表中的次数,在第6列中出现1,并且还计算一个人(第7列)在该行指定日期之前出现在列表中的次数(注意,他们是按时间顺序排序的)(使用基于零的列引用) 示例数据集 我正在使用的代码 这将返回: 最终,我希望对我生成的百分比数据执行卡方检验。然而,目前我所希望实现的是能够计算和求和唯一数据类中任何一个人的分数概率(第2列)并将其作为一个新列附加到csv。我不确定我使用的代码是否可以作为一个整体代码进行编辑以实现这一点。如果您对如何最好地实现这一点提出任何建设性建议或意见,我们将不胜感激 我期望的输出如下:Python 计算卡方检验中使用的先前机会数,python,python-2.7,csv,chi-squared,Python,Python 2.7,Csv,Chi Squared,因此,我使用一个脚本来计算一个人在该行指定日期之前出现在列表中的次数,在第6列中出现1,并且还计算一个人(第7列)在该行指定日期之前出现在列表中的次数(注意,他们是按时间顺序排序的)(使用基于零的列引用) 示例数据集 我正在使用的代码 这将返回: 最终,我希望对我生成的百分比数据执行卡方检验。然而,目前我所希望实现的是能够计算和求和唯一数据类中任何一个人的分数概率(第2列)并将其作为一个新列附加到csv。我不确定我使用的代码是否可以作为一个整体代码进行编辑以实现这一点。如果您对如何最好地实现这一
这并不是对您的问题的完整回答(因为您试图做的事情有点含糊不清),只是为了向您展示如何自然地适应这种计算;您还可以通过名称而不是索引来调用列 假设您有一个
test.csv
文件,如下所示:
date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
date x0 cls x1 x2 x3 tag name
0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith
1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James
2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly
3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith
...
我为每一列指定了名称。您可以通过
import pandas as pd
df = pd.DataFrame.from_csv( 'test.csv', index_col=None )
df
将如下所示:
date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
date x0 cls x1 x2 x3 tag name
0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith
1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James
2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly
3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith
...
我删除您未使用的列(这只是为了演示,您不必删除这些列)
现在,df
如下所示:
date cls tag name
0 02/01/2005 Class xpv 1 George Smith
1 02/01/2005 Class xpv 2 Ted James
2 02/01/2005 Class xpv 3 Emma Lilly
3 02/01/2005 Class xpv 5 George Smith
...
date cls tag name cnt1 cnt frac
0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000
1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000
2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000
3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000
4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000
5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000
6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000
7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000
8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000
9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000
10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000
11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000
12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000
13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000
14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000
15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000
17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667
18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333
假设您要查找每个人在每天之前的日期中出现的累计次数:
pv = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
api文档(请参阅)包括每个方法的详细说明
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 0 0 0 0
03/01/2005 1 2 1 1
04/01/2005 2 4 1 3
05/01/2005 2 7 2 5
或者可以使用groupby
:
df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
要执行相同的计算,但仅针对标记==1
,可以执行以下操作
idx = df.tag == 1
pv1 = df[ idx ].pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
或者使用groupby
语法:
df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
这将是:
date Emma Lilly George Smith Ted James
02/01/2005 0 0 0
03/01/2005 0 1 0
04/01/2005 1 1 0
05/01/2005 1 2 0
为了填充这两个新列,我们编写了一个helper函数,以便在缺少值时返回到0:
def lookup( pivot_table, col, idx, fall_back=0 ):
try:
return pivot_table[ col ][ idx ]
except KeyError:
return fall_back
df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
我们得到:
date cls tag name cnt1 cnt
0 02/01/2005 Class xpv 1 George Smith 0 0
1 02/01/2005 Class xpv 2 Ted James 0 0
2 02/01/2005 Class xpv 3 Emma Lilly 0 0
3 02/01/2005 Class xpv 5 George Smith 0 0
4 02/01/2005 Class tn2 4 Tom Phillips 0 0
5 03/01/2005 Class tn2 2 Tom Phillips 0 1
6 03/01/2005 Class tn2 5 George Smith 1 2
7 03/01/2005 Class tn2 3 Tom Phillips 0 1
8 03/01/2005 Class tn2 1 Emma Lilly 0 1
9 03/01/2005 Class tn2 6 George Smith 1 2
10 04/01/2005 Class tn2 6 Ted James 0 1
11 04/01/2005 Class tn2 3 Tom Phillips 0 3
12 04/01/2005 Class tn2 2 George Smith 1 4
13 04/01/2005 Class tn2 4 George Smith 1 4
14 04/01/2005 Class tn2 1 George Smith 1 4
15 04/01/2005 Class tn2 5 Tom Phillips 0 3
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2
17 05/01/2005 Class 22zn 1 Ted James 0 2
18 05/01/2005 Class 22zn 2 George Smith 2 7
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5
如果我知道你是如何计算最后一列的话,我可以继续说下去。例如,为什么“汤姆·菲利普斯”在第六行得到0.2
编辑:好的,让我们继续。我们需要了解每个人在每个日期出现的次数;这是另一个透视表:
appr = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).fillna( 0 )
或
输出:
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 1 2 1 1
03/01/2005 1 2 0 2
04/01/2005 0 3 1 2
05/01/2005 2 1 1 1
date
02/01/2005 5
03/01/2005 5
04/01/2005 6
05/01/2005 5
以及每个日期出现多少人:
total_appr = appr.sum( axis=1 )
输出:
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 1 2 1 1
03/01/2005 1 2 0 2
04/01/2005 0 3 1 2
05/01/2005 2 1 1 1
date
02/01/2005 5
03/01/2005 5
04/01/2005 6
05/01/2005 5
要计算累积分数,您只需将每行除以总数,再除以一(因为我们查找以前的日期),然后计算累积总和:
frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( )
df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ]
现在,df
如下所示:
date cls tag name
0 02/01/2005 Class xpv 1 George Smith
1 02/01/2005 Class xpv 2 Ted James
2 02/01/2005 Class xpv 3 Emma Lilly
3 02/01/2005 Class xpv 5 George Smith
...
date cls tag name cnt1 cnt frac
0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000
1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000
2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000
3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000
4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000
5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000
6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000
7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000
8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000
9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000
10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000
11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000
12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000
13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000
14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000
15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000
17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667
18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333
在最后一列的两行中,我的数字与你的不同。因此,要么是我计算错了,要么是你计算错了这两个数字。这不应该是你问题的完整答案(因为你试图做的事情有点模棱两可)但为了向您展示如何自然地适应这种计算,您还可以通过名称而不是索引来调用列 假设您有一个
test.csv
文件,如下所示:
date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
date x0 cls x1 x2 x3 tag name
0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith
1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James
2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly
3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith
...
我为每一列指定了名称。您可以通过
import pandas as pd
df = pd.DataFrame.from_csv( 'test.csv', index_col=None )
df
将如下所示:
date,x0,cls,x1,x2,x3,tag,name
02/01/2005,Data,Class xpv,4,11yo+,4,1,George Smith
02/01/2005,Data,Class xpv,4,11yo+,4,2,Ted James
02/01/2005,Data,Class xpv,4,11yo+,4,3,Emma Lilly
02/01/2005,Data,Class xpv,4,11yo+,4,5,George Smith
...
date x0 cls x1 x2 x3 tag name
0 02/01/2005 Data Class xpv 4 11yo+ 4 1 George Smith
1 02/01/2005 Data Class xpv 4 11yo+ 4 2 Ted James
2 02/01/2005 Data Class xpv 4 11yo+ 4 3 Emma Lilly
3 02/01/2005 Data Class xpv 4 11yo+ 4 5 George Smith
...
我删除您未使用的列(这只是为了演示,您不必删除这些列)
现在,df
如下所示:
date cls tag name
0 02/01/2005 Class xpv 1 George Smith
1 02/01/2005 Class xpv 2 Ted James
2 02/01/2005 Class xpv 3 Emma Lilly
3 02/01/2005 Class xpv 5 George Smith
...
date cls tag name cnt1 cnt frac
0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000
1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000
2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000
3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000
4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000
5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000
6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000
7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000
8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000
9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000
10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000
11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000
12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000
13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000
14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000
15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000
17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667
18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333
假设您要查找每个人在每天之前的日期中出现的累计次数:
pv = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
api文档(请参阅)包括每个方法的详细说明
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 0 0 0 0
03/01/2005 1 2 1 1
04/01/2005 2 4 1 3
05/01/2005 2 7 2 5
或者可以使用groupby
:
df.groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
要执行相同的计算,但仅针对标记==1
,可以执行以下操作
idx = df.tag == 1
pv1 = df[ idx ].pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).shift( 1 ).fillna( 0 ).cumsum( )
或者使用groupby
语法:
df[ df.tag == 1 ].groupby(['date', 'name'])['name'].aggregate(len).unstack( ).shift( 1 ).fillna( 0 ).cumsum( )
这将是:
date Emma Lilly George Smith Ted James
02/01/2005 0 0 0
03/01/2005 0 1 0
04/01/2005 1 1 0
05/01/2005 1 2 0
为了填充这两个新列,我们编写了一个helper函数,以便在缺少值时返回到0:
def lookup( pivot_table, col, idx, fall_back=0 ):
try:
return pivot_table[ col ][ idx ]
except KeyError:
return fall_back
df[ 'cnt1' ] = [ lookup( pv1, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
df[ 'cnt' ] = [ lookup( pv, row[ 'name' ], row[ 'date' ] ) for idx, row in df.iterrows( ) ]
我们得到:
date cls tag name cnt1 cnt
0 02/01/2005 Class xpv 1 George Smith 0 0
1 02/01/2005 Class xpv 2 Ted James 0 0
2 02/01/2005 Class xpv 3 Emma Lilly 0 0
3 02/01/2005 Class xpv 5 George Smith 0 0
4 02/01/2005 Class tn2 4 Tom Phillips 0 0
5 03/01/2005 Class tn2 2 Tom Phillips 0 1
6 03/01/2005 Class tn2 5 George Smith 1 2
7 03/01/2005 Class tn2 3 Tom Phillips 0 1
8 03/01/2005 Class tn2 1 Emma Lilly 0 1
9 03/01/2005 Class tn2 6 George Smith 1 2
10 04/01/2005 Class tn2 6 Ted James 0 1
11 04/01/2005 Class tn2 3 Tom Phillips 0 3
12 04/01/2005 Class tn2 2 George Smith 1 4
13 04/01/2005 Class tn2 4 George Smith 1 4
14 04/01/2005 Class tn2 1 George Smith 1 4
15 04/01/2005 Class tn2 5 Tom Phillips 0 3
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2
17 05/01/2005 Class 22zn 1 Ted James 0 2
18 05/01/2005 Class 22zn 2 George Smith 2 7
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5
如果我知道你是如何计算最后一列的话,我可以继续说下去。例如,为什么“汤姆·菲利普斯”在第六行得到0.2
编辑:好的,让我们继续。我们需要了解每个人在每个日期出现的次数;这是另一个透视表:
appr = df.pivot_table( cols='name',
rows='date',
values='tag',
aggfunc=len ).fillna( 0 )
或
输出:
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 1 2 1 1
03/01/2005 1 2 0 2
04/01/2005 0 3 1 2
05/01/2005 2 1 1 1
date
02/01/2005 5
03/01/2005 5
04/01/2005 6
05/01/2005 5
以及每个日期出现多少人:
total_appr = appr.sum( axis=1 )
输出:
date Emma Lilly George Smith Ted James Tom Phillips
02/01/2005 1 2 1 1
03/01/2005 1 2 0 2
04/01/2005 0 3 1 2
05/01/2005 2 1 1 1
date
02/01/2005 5
03/01/2005 5
04/01/2005 6
05/01/2005 5
要计算累积分数,您只需将每行除以总数,再除以一(因为我们查找以前的日期),然后计算累积总和:
frac = appr.apply( lambda x: x / total_appr ).shift( 1 ).fillna( 0 ).cumsum( )
df[ 'frac' ] = [ frac[ row[ 'name' ] ][ row[ 'date' ] ] for idx, row in df.iterrows( ) ]
现在,df
如下所示:
date cls tag name
0 02/01/2005 Class xpv 1 George Smith
1 02/01/2005 Class xpv 2 Ted James
2 02/01/2005 Class xpv 3 Emma Lilly
3 02/01/2005 Class xpv 5 George Smith
...
date cls tag name cnt1 cnt frac
0 02/01/2005 Class xpv 1 George Smith 0 0 0.000000
1 02/01/2005 Class xpv 2 Ted James 0 0 0.000000
2 02/01/2005 Class xpv 3 Emma Lilly 0 0 0.000000
3 02/01/2005 Class xpv 5 George Smith 0 0 0.000000
4 02/01/2005 Class tn2 4 Tom Phillips 0 0 0.000000
5 03/01/2005 Class tn2 2 Tom Phillips 0 1 0.200000
6 03/01/2005 Class tn2 5 George Smith 1 2 0.400000
7 03/01/2005 Class tn2 3 Tom Phillips 0 1 0.200000
8 03/01/2005 Class tn2 1 Emma Lilly 0 1 0.200000
9 03/01/2005 Class tn2 6 George Smith 1 2 0.400000
10 04/01/2005 Class tn2 6 Ted James 0 1 0.200000
11 04/01/2005 Class tn2 3 Tom Phillips 0 3 0.600000
12 04/01/2005 Class tn2 2 George Smith 1 4 0.800000
13 04/01/2005 Class tn2 4 George Smith 1 4 0.800000
14 04/01/2005 Class tn2 1 George Smith 1 4 0.800000
15 04/01/2005 Class tn2 5 Tom Phillips 0 3 0.600000
16 05/01/2005 Class 22zn 3 Emma Lilly 1 2 0.400000
17 05/01/2005 Class 22zn 1 Ted James 0 2 0.366667
18 05/01/2005 Class 22zn 2 George Smith 2 7 1.300000
19 05/01/2005 Class 22zn 4 Emma Lilly 1 2 0.400000
20 05/01/2005 Class 22zn 5 Tom Phillips 0 5 0.933333
在最后一列的两行中,我的数字与你的不同。因此,要么是我计算错了,要么是你计算错了这两个数字。这应该很简单,只是不清楚你所说的“唯一数据类中任何一个人的分数概率”是什么意思。例如,对于数据类
xpv
,您的数据以5行开始,其中George Smith
出现两次。您希望看到George Smith的“分数概率”是多少?其他人(出现一次)希望看到什么?为什么示例输出在xpv
行旁边只显示零
答案可能取决于日期类是否会在以后的日期重复出现,以及这是否与您的计算有关;但如果您能解释如何计算前5个值,那么其余的可能会变得清晰。(如果不是,请解释第二组,其中的值确实变为非零。)
注:也许这在t中的讨论中得到了解决