Python 数据帧计算
我有一个相当复杂的数据框架,看起来像这样:Python 数据帧计算,python,python-2.7,pandas,Python,Python 2.7,Pandas,我有一个相当复杂的数据框架,看起来像这样: df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12, ('Total Number of End Points', '0.1um', '0hr'): 8, ('Total Number of End Points', 'Control', '0hr'): 4, ('Total Number of End Points', '0.01um',
df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6},
'1': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6},
'2': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6}})
print(df)
0 1 2
Total Number of End Points 0.01um 0hr 12 12 12
24hr 18 18 18
0.1um 0hr 8 8 8
24hr 12 12 12
Control 0hr 4 4 4
24hr 6 6 6
Total Vessel Length 0.01um 0hr 12 12 12
24hr 18 18 18
0.1um 0hr 8 8 8
24hr 12 12 12
Control 0hr 4 4 4
24hr 6 6 6
我试图将每个值除以相应控制级别中列的平均值。我尝试了以下方法,但没有成功
df2 = df.divide(df.xs('Control', level=1).mean(axis=1), axis='index')
我对python和pandas非常陌生,所以我倾向于用MS Excel的术语来思考这个问题
如果在Excel中,A1的公式(‘终点总数’、‘0.01um’、‘0hr’、‘0)看起来是:
=A1/平均值($A$5:$C$5)
B1(‘终点总数’、‘0.01um’、‘0hr’、1)将为:
=B1/平均值($A$5:$C$5)
A2(‘终点总数’、‘0.01um’、‘24小时’、‘0’)为
=A1/平均值($A$6:$C$6)
本例的预期结果为:
0 1 2
Total Number of End Points 0.01um 0hr 3 3 3
24hr 3 3 3
0.1um 0hr 2 2 2
24hr 2 2 2
Control 0hr 1 1 1
24hr 1 1 1
Total Vessel Length 0.01um 0hr 3 3 3
24hr 3 3 3
0.1um 0hr 2 2 2
24hr 2 2 2
Control 0hr 1 1 1
24hr 1 1 1
注:实际数据中有许多索引和列。这里的问题是,熊猫的组织方式很容易计算列,问题是需要从其他行中减去一行的平均值。熊猫不是设计用来工作的 但是,您可以使用transpose
.T
轻松地切换行和列,这样它可能更容易处理,事实上,控制平均值是一行
>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')].mean()
4.0
该4.0来自原始数据中的两个4.0值:
>>> df.T[(u'Total Vessel Length', u'Control', u'0hr')]
a 4
b 4
现在看来for循环将解决这个问题
未经测试:
for primary in (u'Total Vessel Length',u'Total Number of End Points'):
for um in (u'0.01um',u'0.1um'):
for hours in (u'0hr',u'24hr'):
df.T[(primary,um,hours)]=df.T[(primary,um,hours)]/df.T[(primary, u'Control', hours)].mean()
请注意,这并不划分非控制列,但很容易将“控制”包含到um循环中
更新这不起作用,不知何故它没有在适当的位置修改数据帧。现在,我不知道为什么
但是您可以通过在dict上调用pd.DataFrame来构造新的数据帧
理解力
这似乎起作用了
import pandas as pd
df = pd.DataFrame({'0': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6},
'1': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6},
'2': {('Total Number of End Points', '0.01um', '0hr'): 12,
('Total Number of End Points', '0.1um', '0hr'): 8,
('Total Number of End Points', 'Control', '0hr'): 4,
('Total Number of End Points', '0.01um', '24hr'): 18,
('Total Number of End Points', '0.1um', '24hr'): 12,
('Total Number of End Points', 'Control', '24hr'): 6,
('Total Vessel Length', '0.01um', '0hr'): 12,
('Total Vessel Length', '0.1um', '0hr'): 8,
('Total Vessel Length', 'Control', '0hr'): 4,
('Total Vessel Length', '0.01um', '24hr'): 18,
('Total Vessel Length', '0.1um', '24hr'): 12,
('Total Vessel Length', 'Control', '24hr'): 6}})
print df
df2 = pd.DataFrame({(primary,um,hours):df.T[(primary,um,hours)]/df.T[(primary,u'Control',hours)].mean() for primary in (u'Total Vessel Length',u'Total Number of End Points') for um in (u'0.01um',u'0.1um') for hours in (u'0hr',u'24hr')})
print df2.T
输出
paul@home:~/SO$ python ./r.py
0 1 2
(Total Number of End Points, 0.01um, 0hr) 12 12 12
(Total Number of End Points, 0.01um, 24hr) 18 18 18
(Total Number of End Points, 0.1um, 0hr) 8 8 8
(Total Number of End Points, 0.1um, 24hr) 12 12 12
(Total Number of End Points, Control, 0hr) 4 4 4
(Total Number of End Points, Control, 24hr) 6 6 6
(Total Vessel Length, 0.01um, 0hr) 12 12 12
(Total Vessel Length, 0.01um, 24hr) 18 18 18
(Total Vessel Length, 0.1um, 0hr) 8 8 8
(Total Vessel Length, 0.1um, 24hr) 12 12 12
(Total Vessel Length, Control, 0hr) 4 4 4
(Total Vessel Length, Control, 24hr) 6 6 6
[12 rows x 3 columns]
0 1 2
(Total Number of End Points, 0.01um, 0hr) 3 3 3
(Total Number of End Points, 0.01um, 24hr) 3 3 3
(Total Number of End Points, 0.1um, 0hr) 2 2 2
(Total Number of End Points, 0.1um, 24hr) 2 2 2
(Total Vessel Length, 0.01um, 0hr) 3 3 3
(Total Vessel Length, 0.01um, 24hr) 3 3 3
(Total Vessel Length, 0.1um, 0hr) 2 2 2
(Total Vessel Length, 0.1um, 24hr) 2 2 2
[8 rows x 3 columns]
将
控件
值放在它们自己的列中会有所帮助。您可以使用取消堆叠:
df.index.names = ['field', 'type', 'time']
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1)
# type 0.01um 0.1um Control 0.01um 0.1um Control \
# 0 0 0 1 1 1
# field time
# Total Number of End Points 0hr 12 8 4 12 8 4
# 24hr 18 12 6 18 12 6
# Total Vessel Length 0hr 12 8 4 12 8 4
# 24hr 18 12 6 18 12 6
# type 0.01um 0.1um Control
# 2 2 2
# field time
# Total Number of End Points 0hr 12 8 4
# 24hr 18 12 6
# Total Vessel Length 0hr 12 8 4
# 24hr 18 12 6
现在找到每个控件的平均值:
ave = df2['Control'].mean(axis=1)
# field time
# Total Number of End Points 0hr 4
# 24hr 6
# Total Vessel Length 0hr 4
# 24hr 6
# dtype: float64
正如您所期望的,您可以使用df2.divide
来计算所需的结果。确保使用axis=0
告诉熊猫根据行索引匹配值(在df2
和ave
)
result = df2.divide(ave, axis=0)
# type 0.01um 0.1um Control 0.01um 0.1um Control \
# 0 0 0 1 1 1
# field time
# Total Number of End Points 0hr 3 2 1 3 2 1
# 24hr 3 2 1 3 2 1
# Total Vessel Length 0hr 3 2 1 3 2 1
# 24hr 3 2 1 3 2 1
# type 0.01um 0.1um Control
# 2 2 2
# field time
# Total Number of End Points 0hr 3 2 1
# 24hr 3 2 1
# Total Vessel Length 0hr 3 2 1
# 24hr 3 2 1
本质上,你追求的是价值观。但是,如果您希望重新排列数据帧,使其与您发布的内容完全一致,则:
result = result.stack(['type'])
result = result.reorder_levels(['field','type','time'], axis=0)
result = result.reindex(df.index)
屈服
0 1 2
field type time
Total Number of End Points 0.01um 0hr 3 3 3
24hr 3 3 3
0.1um 0hr 2 2 2
24hr 2 2 2
Control 0hr 1 1 1
24hr 1 1 1
Total Vessel Length 0.01um 0hr 3 3 3
24hr 3 3 3
0.1um 0hr 2 2 2
24hr 2 2 2
Control 0hr 1 1 1
24hr 1 1 1
总而言之:
df.index.names = ['field', 'type', 'time']
df2 = df.unstack(['type']).swaplevel(0, 1, axis=1)
ave = df2['Control'].mean(axis=1)
result = df2.divide(ave, axis=0)
result = result.stack(['type'])
result = result.reorder_levels(['field','type','time'], axis=0)
result = result.reindex(df.index)
你能提供一个期望输出的例子吗?当我把你的问题顶部的数据放入一个数据框时,它与你用print(df)得到的不同。df=。。。打印(df)是两个不同的数据帧。您的打印(df)与上述代码无关。您的输入列为['a','b',,但打印列为[0,1,2]。你能让一切保持一致吗。谢谢。@MarkGraph哇塞。。你说得对。。我会解决的。在熊猫中,数据在内部是按列组织的,因此提取或计算列是最容易的。您是否可以重新组织数据,使所有控件值都位于它们自己的列中?我得到的结果与中相同。在什么地方需要inplace=True
吗?这里也一样。看起来非常熟悉。我会四处看看,也许是相关的。还在看。有趣。我没有注意到索引可以是元组,并且有所有这些关联的方法。