Python 基于名称对某些列进行乘法和求和_Python_Pandas

Python 基于名称对某些列进行乘法和求和

python pandas

Python 基于名称对某些列进行乘法和求和,python,pandas,Python,Pandas,我有一个小样本数据集： import pandas as pd d = { 'measure1_x': [10,12,20,30,21], 'measure2_x':[11,12,10,3,3], 'measure3_x':[10,0,12,1,1], 'measure1_y': [1,2,2,3,1], 'measure2_y':[1,1,1,3,3], 'measure3_y':[1,0,2,1,1] } df = pd.DataFrame(d) df = df.re

我有一个小样本数据集：

import pandas as pd
d = {
  'measure1_x': [10,12,20,30,21],
  'measure2_x':[11,12,10,3,3],
  'measure3_x':[10,0,12,1,1],
  'measure1_y': [1,2,2,3,1],
  'measure2_y':[1,1,1,3,3],
  'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
    'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)

它看起来像：

      measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y
          10          11          10           1           1           1
          12          12           0           2           1           0
          20          10          12           2           1           2
          30           3           1           3           3           1
          21           3           1           1           3           1

我创建了几乎相同的列名，除了“_x”和“_y”以帮助识别应该相乘的对：我想在忽略“_x”和“_y”的情况下使用相同的列名相乘，然后我想求和得到一个总数，请记住，我的实际数据集非常庞大，而且列的顺序也不完美，因此此命名是一种识别正确对以进行乘法的方法：

total=measure1_x*measure1_y+measure2_x*measure2_y+measure3_x*measure3_y

measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y   total

 10          11          10           1           1           1           31 
 12          12           0           2           1           0           36 
 20          10          12           2           1           2           74
 30           3           1           3           3           1          100
 21           3           1           1           3           1           31

所需输出：

measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y   total

 10          11          10           1           1           1           31 
 12          12           0           2           1           0           36 
 20          10          12           2           1           2           74
 30           3           1           3           3           1          100
 21           3           1           1           3           1           31

我的尝试和思考过程，但无法继续语法：

#first identify the column names that has '_x' and '_y', then identify if 
#the column names are the same after removing '_x' and '_y', if the pair has 
#the same name then multiply them, do that for all pairs and sum the results 
#up to get the total number

for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
    if "_x".lower() in colname.lower():  
        colnamex = colname
    if "_y".lower() in colname.lower():
        colnamey = colname

    #if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum

使用
```
df.columns.str.split
```
生成新的多索引
将
```
prod
```
与
```
axis
```
和
```
level
```
参数一起使用
将
```
sum
```
与
```
axis
```
参数一起使用
使用
```
assign
```
创建新列

将dataframe限制为看起来像

'meausre[i]\uj]'

调试看看这是否能让你得到正确的总数

d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)

d_.prod(axis=1, level=0).sum(1)

0     31
1     36
2     74
3    100
4     31
dtype: int64

filter

np.einsum

我想这次我会尝试一些不同的东西-

分别获取
```
\ux
```
和
```
\uy
```
列
做一个乘积和。这很容易用
```
einsum
```
（和fast）指定

一个稍微健壮的版本，它过滤掉非数字列并预先执行断言-

df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x') 
j = df.filter(regex='.*_y')

assert i.shape == j.shape

df['Total'] = np.einsum('ij,ij->i', i, j)

如果断言失败，则假设1）您的列是数字的，2）x和y列的数量相等，正如您的问题所建议的那样，不适用于您的实际数据集。

我尝试使用较大的实际数据集，得到：TypeError:set_axis（）参数“axis”有多个值您是否在

set_axis

调用中意外使用了

axis

两次？我的实际数据集列的顺序不完美，如果我不完全理解您的代码，请原谅，但是您在哪里根据名称确定要相乘的正确对？还是不需要？是否在set_axis调用中意外使用了axis两次不，我复制并粘贴了你的确切代码请尝试我的更新建议。你的专栏不仅仅局限于那些看起来像

measurei_j

的专栏吗？我打算强迫某个地方的点产品变得酷一些。现在我不必了，因为我已经用

einsum

（：@pirpsquared我也试图强制一个，直到我想起那天我是如何被这样的冷落的XD

einsum

总是超级酷，但这不是假设列是有序的吗？@filippo当然过滤器会对列名的性质进行假设。但是piR的答案也是如此。顺便说一句，我已经添加了一个可选的

df.sort\u index（axis=1）

以上步骤，以备需要。uh完全错过了

sort\u index

！

df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted

i = df.filter(like='_x') 
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)

df
   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x') 
j = df.filter(regex='.*_y')

assert i.shape == j.shape

df['Total'] = np.einsum('ij,ij->i', i, j)