Python 3.x 熊猫:罕见的数据形式扰乱了正常的统计分析
我在分析一些生物信息学数据(在熊猫中)时遇到了一个问题,一种罕见但有效的数据形式扰乱了对所述数据的统计分析。这是数据(在数据框Python 3.x 熊猫:罕见的数据形式扰乱了正常的统计分析,python-3.x,pandas,bioinformatics,Python 3.x,Pandas,Bioinformatics,我在分析一些生物信息学数据(在熊猫中)时遇到了一个问题,一种罕见但有效的数据形式扰乱了对所述数据的统计分析。这是数据(在数据框组\u PrEST中)通常的样子: PrEST ID Gene pepCN1 pepCN2 pepCN3 HPRR1 CD38 5298 10158 NaN HPRR2 EGFR 79749 85793 117274 HPRR6 EPS8 68076 62305
组\u PrEST
中)通常的样子:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR1 CD38 5298 10158 NaN
HPRR2 EGFR 79749 85793 117274
HPRR6 EPS8 68076 62305 66599
HPRR6 EPS8 NaN NaN 141828
下面是一些处理此数据的代码(PrEST_stats
是另一个收集统计数据的数据帧):
它的基本功能是:
- 上述基因
的数据将给出两个中值(CD38
和R1
)与它们在R2
和pepCN1
中的起源相同(自 只有一行基因pepCN2
)CD38
- 基因
将以类似的方式给出EPS8
和R1
,但分配R2
R3的另一个值基于 列
的两行pepCN3
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR9 PTK2B 4972 NaN NaN
HPRR9 PTK2B 17095 NaN NaN
indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
.index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}
# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
current_stats = all_stats[n]
if n == 1:
current_stats['R1'] = current_replicate['pepCN1'].median()
current_stats['R2'] = current_replicate['pepCN2'].median()
current_stats['R3'] = current_replicate['pepCN3'].median()
else:
current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())
current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
current_stats['CN'] = current_stats['CN'].round()
current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
current_stats['CV'] = current_stats['STD'] / \
current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
current_stats['STD'] = current_stats['STD'].round()
current_stats['CV'] = current_stats['CV'].round(1)
PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)
在这里,脚本将把两个pepCN1
值减少为一个中位数,而不考虑没有可用于计算其他数据列统计数据的值(即复制2和3中的数据)。脚本将运行并给出正确的CN
值(中位数),但标准偏差和变异系数的统计数据将被忽略(即显示为NaN
)
在这种情况下,我希望脚本以某种方式看到,将数据列减少到一个值(第一个中间值)不是可行的方法。基本上,我希望它跳过计算第一个中位数(这里:R1
),只计算两行pepCN1
的统计数据。有办法做到这一点吗?提前谢谢
[编辑:新问题]
好的,现在代码如下所示:
PrEST ID Gene pepCN1 pepCN2 pepCN3
HPRR9 PTK2B 4972 NaN NaN
HPRR9 PTK2B 17095 NaN NaN
indexer = PrESTs.groupby('PrEST ID').median().count(1) == 1
one_replicate = PrESTs.loc[PrESTs['PrEST ID'].isin(indexer[indexer].index)]
multiple_replicates = PrESTs.loc[~PrESTs['PrEST ID'].isin(indexer[indexer]
.index)]
all_replicates = {0: one_replicate, 1: multiple_replicates}
# Calculations (PrESTs)
PrEST_stats_1 = pd.DataFrame()
PrEST_stats_2 = pd.DataFrame()
all_stats = {0: PrEST_stats_1, 1: PrEST_stats_2}
for n in range(2):
current_replicate = all_replicates[n].groupby(['PrEST ID', 'Gene names'])
current_stats = all_stats[n]
if n == 1:
current_stats['R1'] = current_replicate['pepCN1'].median()
current_stats['R2'] = current_replicate['pepCN2'].median()
current_stats['R3'] = current_replicate['pepCN3'].median()
else:
current_stats['R1'] = current_replicate['pepCN1'] # PROBLEM (not with .median())
current_stats['R2'] = current_replicate['pepCN2'] # PROBLEM (not with .median())
current_stats['R3'] = current_replicate['pepCN3'] # PROBLEM (not with .median())
current_stats['CN'] = current_stats[['R1', 'R2', 'R3']].median(axis=1)
current_stats['CN'] = current_stats['CN'].round()
current_stats['STD'] = current_stats[['R1', 'R2', 'R3']].std(axis=1, ddof=1)
current_stats['CV'] = current_stats['STD'] / \
current_stats[['R1', 'R2', 'R3']].mean(axis=1) * 100
current_stats['STD'] = current_stats['STD'].round()
current_stats['CV'] = current_stats['CV'].round(1)
PrEST_stats = PrEST_stats_1.append(PrEST_stats_2)
。。。我有一个新问题。将这两个案例划分为两个新的数据帧效果很好,我现在想做的是在上面的for循环中稍微不同地处理它们。我已经检查了注释#PROBLEM
的行,在那里我也添加了.median()
,给了我以前得到的相同结果-即,代码的其余部分可以工作,只是当我试图保持数据的原样时就不行了!这是我得到的错误:
Traceback (most recent call last):
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 110, in <module>
current_stats['R1'] = current_replicate['pepCN1']
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1863, in __setitem__
self._set_item(key, value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1938, in _set_item
self._ensure_valid_index(value)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/frame.py", line 1915, in _ensure_valid_index
raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
。。。其中temp
是一个空列表,但我得到以下错误:
File "/Users/erikfas/Dropbox/Jobb/Data - QE/QEtest.py", line 119, in <module>
temp2 = pd.concat(temp)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 926, in concat
verify_integrity=verify_integrity)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/tools/merge.py", line 986, in __init__
if not 0 <= axis <= sample.ndim:
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 295, in __getattr__
return self._make_wrapper(attr)
File "/Users/erikfas/anaconda/envs/py33/lib/python3.3/site-packages/pandas/core/groupby.py", line 310, in _make_wrapper
raise AttributeError(msg)
AttributeError: Cannot access attribute 'ndim' of 'SeriesGroupBy' objects, try using the 'apply' method
文件“/Users/erikfas/Dropbox/Jobb/Data-QE/QEtest.py”,第119行,在
温度2=局部混凝土(温度)
concat中的文件“/Users/erikfas/anaconda/envs/py33/lib/python3.3/site packages/pandas/tools/merge.py”,第926行
验证\u完整性=验证\u完整性)
文件“/Users/erikfas/anaconda/envs/py33/lib/python3.3/site packages/pandas/tools/merge.py”,第986行,在__
如果不是0请尝试此方法
In [28]: df
Out[28]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
4 HPRR9 PTK2B 4972 NaN NaN
5 HPRR9 PTK2B 17095 NaN NaN
[6 rows x 5 columns]
Groupby id字段(我想这是您需要有效中间值的地方)。弄清
如果它们是该组中的任何无效中位数(例如,它们在组合该组后出现nan)
您想删除只有1个有效值的组,好吗
In [54]: df.groupby('id').median().count(1) == 1
Out[54]:
id
HPRR1 False
HPRR2 False
HPRR6 False
HPRR9 True
dtype: bool
In [30]: indexers = df.groupby('id').median().count(1) == 1
从原始数据中取出(然后重新运行)或填充或其他内容
In [67]: df.loc[~df.id.isin(indexers[indexers].index)]
Out[67]:
id gene p1 p2 p3
0 HPRR1 CD38 5298 10158 NaN
1 HPRR2 EGFR 79749 85793 117274
2 HPRR6 EPS8 68076 62305 66599
3 HPRR6 EPS8 NaN NaN 141828
[4 rows x 5 columns]
对于你的整体计算,你可以这样做。这比附加到最初为空的数据帧更可取
results = []
for r in range(2):
# do the calcs from above to generate say df1 and df2
results.append(df1)
results.append(df2)
# concatenate the rows!
final_result = pd.concat(results)
prob最简单的方法是:df.dropna(subset=['pepCN2','pepCN3',how='all',axis=1)
apriori将它们从主计算中删除,然后使用删除的返回值实际执行您需要的操作。首先,只有当数据看起来像最后一种情况时,我才能这样做?我的意思是,在正常情况下,我会想做我上面所做的,只是当只有一列数据的时候。第二,我该如何武断地做到这一点,即当包含所有数据的是pepCN2或pepCN3而不是如上所述的pepCN1时?非常感谢您的回答,但HPRR1是完全有效的,不应删除=/在正常数据处理中无效的唯一ID是HPRR9,因为它是唯一一个只在一列中有数据值,但在多行(同一行)中有数据值的。对不起,我不清楚!更新…这就是这样做的美妙之处,你可以很容易地将你的面具更改为你需要的任何功能。非常好!新的问题出现了,尽管是经过编辑的原始问题;相反,将计算结果附加到一个列表中,然后在最后做一个pd.concat(帧列表)
(也会快得多)你的意思是用列表来代替吗?我在用熊猫,这样我就不用这么做了。。。问题不在于我在添加两个PrEST_stat帧时,而是在我尝试这样做时:current_stats['R1']=current_replicate['pepCN1']
results = []
for r in range(2):
# do the calcs from above to generate say df1 and df2
results.append(df1)
results.append(df2)
# concatenate the rows!
final_result = pd.concat(results)