Python 从一个数据帧中减去另一个数据帧中的属性值
此问题包含3个独立的数据帧。 df1表示产品1,2,3的“总计”,包含“值1”、“值2” df2表示产品1,2,3的“Customer1”,包含“value1”、“value2” df3表示产品1,2,3的“Customer2”,包含“value1”、“value2” df2和df3本质上是df1的子集 我想创建另一个数据帧,从df1中减去df2和df3,并标记这个df4。我希望df4是“市场”栏中的“剩余客户” 这就是我到目前为止所做的Python 从一个数据帧中减去另一个数据帧中的属性值,python,python-3.x,pandas,Python,Python 3.x,Pandas,此问题包含3个独立的数据帧。 df1表示产品1,2,3的“总计”,包含“值1”、“值2” df2表示产品1,2,3的“Customer1”,包含“value1”、“value2” df3表示产品1,2,3的“Customer2”,包含“value1”、“value2” df2和df3本质上是df1的子集 我想创建另一个数据帧,从df1中减去df2和df3,并标记这个df4。我希望df4是“市场”栏中的“剩余客户” 这就是我到目前为止所做的 import pandas as pd d1 = {
import pandas as pd
d1 = {'Market': ['Total', 'Total','Total'], 'Product Code': [1, 2, 3],
'Value1':[10, 20, 30], 'Value2':[5, 15, 25]}
df1 = pd.DataFrame(data=d1)
df1
d2 = {'Market': ['Customer1', 'Customer1','Customer1'], 'Product Code': [1,
2, 3], 'Value1':[3, 14, 10], 'Value2':[2, 4, 6]}
df2 = pd.DataFrame(data=d2)
df2
d3 = {'Market': ['Customer2', 'Customer2','Customer2'], 'Product Code': [1,
2, 3], 'Value1':[3, 3, 4], 'Value2':[2, 6, 10]}
df3 = pd.DataFrame(data=d3)
df3
这将产生以下结果
Market Product Code Value1 Value2
0 Total 1 10 5
1 Total 2 20 15
2 Total 3 30 25
Market Product Code Value1 Value2
0 Customer1 1 3 2
1 Customer1 2 14 4
2 Customer1 3 10 6
Market Product Code Value1 Value2
0 Customer2 1 3 2
1 Customer2 2 3 6
2 Customer2 3 4 10
为了创建df4,我尝试了以下代码并得到了一个错误“TypeError:不支持的操作数类型-:“str”和“str”有人能帮忙吗
df4 = df1-(df2+df3)
print(df4)
这里有一个方法:
cols = ['Value1', 'Value2']
df4 = df1[cols].subtract(df2[cols].add(df3[cols]))\
.assign(**{'Market': 'RemainingCustomers', 'Product Code': [1, 2, 3]})\
.sort_index(axis=1)
# Market Product Code Value1 Value2
# 0 RemainingCustomers 1 4 1
# 1 RemainingCustomers 2 3 5
# 2 RemainingCustomers 3 16 9
解释
仅对指定列执行计算df1[cols].subtract(df2[cols].add(df3[cols])
添加结果数据框所需的额外列assign(**{'Market':'RemainingCustomers','Product code':[1,2,3]})
为所需输出重新排序列排序索引(axis=1)
市场
,将产品代码
设置为索引,并对产品代码执行索引对齐算法。之后,只需重置指数并在结果中插入Market
df1, df2, df3 = [
df.drop('Market', 1).set_index('Product Code') for df in [df1, df2, df3]
]
df4 = (df1 - (df2 + df3)).reset_index()
df4.insert(0, 'Market', 'RemainingCustomers')
Market Product Code Value1 Value2
0 RemainingCustomers 1 4 1
1 RemainingCustomers 2 3 5
2 RemainingCustomers 3 16 9
这并不完全符合OP的要求,但在我看来,这可能是一种更好的数据管理方式
df = pd.concat([df1, df2, df3]).set_index(['Product Code', 'Market'])
formula = 'RemainingCustomers = Total - Customer1 - Customer2'
df = df.unstack().stack(0).eval(formula).unstack()
df
Market Customer1 Customer2 Total RemainingCustomers
Value1 Value2 Value1 Value2 Value1 Value2 Value1 Value2
Product Code
1 3 2 3 2 10 5 4 1
2 14 4 3 6 20 15 3 5
3 10 6 4 10 30 25 16 9
及
如果我们坚持要求的产量
df.stack(0).reset_index().query(
'Market == "RemainingCustomers"').reindex(columns=df1.columns)
Market Product Code Value1 Value2
2 RemainingCustomers 1 4 1
6 RemainingCustomers 2 3 5
10 RemainingCustomers 3 16 9
或
也许我们可以使用
选择类型
(df1.select_dtypes(exclude = 'object')
-df2.select_dtypes(exclude = 'object')
-df3.select_dtypes(exclude = 'object')).\
drop('Product Code',1).\
combine_first(df1).\
assign(Market='remaining customers')
Out[133]:
Market Product Code Value1 Value2
0 remaining customers 1.0 4 1
1 remaining customers 2.0 3 5
2 remaining customers 3.0 16 9
工作完美。谢谢这确实有效,但你应该把答案分成多行:)
df.stack(0).xs(
'RemainingCustomers', level=1, drop_level=False
).reset_index().reindex(columns=df1.columns)
Market Product Code Value1 Value2
0 RemainingCustomers 1 4 1
1 RemainingCustomers 2 3 5
2 RemainingCustomers 3 16 9
(df1.select_dtypes(exclude = 'object')
-df2.select_dtypes(exclude = 'object')
-df3.select_dtypes(exclude = 'object')).\
drop('Product Code',1).\
combine_first(df1).\
assign(Market='remaining customers')
Out[133]:
Market Product Code Value1 Value2
0 remaining customers 1.0 4 1
1 remaining customers 2.0 3 5
2 remaining customers 3.0 16 9