Python 如何对数据框的选定列进行Pearson关联_Python_Pandas

Python 如何对数据框的选定列进行Pearson关联

python pandas

Python 如何对数据框的选定列进行Pearson关联,python,pandas,Python,Pandas,我有一个CSV，看起来像这样： gene,stem1,stem2,stem3,b1,b2,b3,special_col foo,20,10,11,23,22,79,3 bar,17,13,505,12,13,88,1 qui,17,13,5,12,13,88,3 In [17]: import pandas as pd In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",") In [21]: df Out[21]

我有一个CSV，看起来像这样：

gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
  gene  stem1  stem2  stem3  b1  b2  b3  special_col
0  foo     20     10     11  23  22  79            3
1  bar     17     13    505  12  13  88            1
2  qui     17     13      5  12  13  88            3

作为数据帧，它如下所示：

gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
  gene  stem1  stem2  stem3  b1  b2  b3  special_col
0  foo     20     10     11  23  22  79            3
1  bar     17     13    505  12  13  88            1
2  qui     17     13      5  12  13  88            3

我想做的是从最后一列（

special\u col

）对

gene

列和

special column

之间的每一列执行皮尔逊相关，即

colnames[1:number\u of_column-1]

在一天结束时，我们将有长度为6的数据帧

Coln   PearCorr
stem1  0.5
stem2 -0.5
stem3 -0.9999453506011533
b1    0.5
b2    0.5
b3    -0.5

上述值是手动计算的：

In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)

我该怎么做？

您可以

在列范围上应用lambda
，该lambda调用corr
，并传递系列
：
In [126]:
df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))

Out[126]:
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
dtype: float64

计时
In [35]: %timeit df.corr().iloc[-1,:-1]
1000 loops, best of 3: 576 us per loop

In [40]: %timeit df.corr().ix['special_col', :-1]
1000 loops, best of 3: 634 us per loop

In [36]: %timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 968 us per loop

In [37]: %timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
100 loops, best of 3: 2.12 ms per loop

实际上，另一种方法更快，因此我希望它能更好地扩展：
In [130]:
%timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
%timeit df[df.columns[1:]].corr()['special_col']

1000 loops, best of 3: 1.75 ms per loop
1000 loops, best of 3: 836 µs per loop

请注意，您的数据中有一个错误，特殊列全部为3，因此无法计算相关性
如果最后删除列选择，您将得到正在分析的所有其他列的相关矩阵。最后一个[：-1]是删除“特殊列”与自身的相关性
In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]: 
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

如果您对速度感兴趣，这在我的机器上稍微快一点：
In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]: 
array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,
       -0.5       ])

In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 µs per loop

In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 µs per loop

但是很明显，它返回的是一个数组，而不是熊猫系列/DF。
为什么不直接执行以下操作：
In [34]: df.corr().iloc[:-1,-1]
Out[34]:
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

或：
计时
In [35]: %timeit df.corr().iloc[-1,:-1]
1000 loops, best of 3: 576 us per loop

In [40]: %timeit df.corr().ix['special_col', :-1]
1000 loops, best of 3: 634 us per loop

In [36]: %timeit df[df.columns[1:]].corr()['special_col']
1000 loops, best of 3: 968 us per loop

In [37]: %timeit df[df.columns[1:-1]].apply(lambda x: x.corr(df['special_col']))
100 loops, best of 3: 2.12 ms per loop

pd.DataFrame.corrwith（）可以代替df.corr（）
传入我们希望与其余列相关的预期列
对于上述特定示例，代码将为：
df.corrwith（df['special_col']）
或者只需df.corr（）['special\u col']即可创建每列与其他列的完整相关性，并将您需要的内容子集。
对不起，您是否要求计算special\u col与单列之间或special\u col与colname中所有col之间的pearson相关性？@EdChum:special\u col以及介于两者之间的每一列。看我的最新作品，谢谢。共有40K行和200多列。有什么方法可以加快速度吗？这将按列进行迭代我不知道是否先在整个df上应用corr
，然后选择“special_col”比只在感兴趣的列上进行更快您应该接受另一个答案，它的速度更快，我希望它在您的真实数据上表现得更好这比我的方法快+1感谢快速计时！在我的机器上也更快！