Python 匹配3个数据帧的5列，并创建一个计算列_Python_Python 3.x_Pandas_Dataframe_Data Analysis

Python 匹配3个数据帧的5列，并创建一个计算列

python python-3.x pandas dataframe

Python 匹配3个数据帧的5列，并创建一个计算列,python,python-3.x,pandas,dataframe,data-analysis,Python,Python 3.x,Pandas,Dataframe,Data Analysis,我有3个数据帧，如下所示： df1： df2： df3：预期产出： a l m n o p 2020 2021 new_2021 a1 b1 c1 d1 e1 f1 534.385 540.210 540.210*(340.210/440.210) a7 b7 c7 d7 e7 f7 544.385 550.210 numpy.nan a4 b2 c4 d4 e4 f4 554.385 560.210 560.210*((460.210+461.210)/560

我有3个数据帧，如下所示：

df1：

df2：

df3：

预期产出：

a  l  m  n  o  p  2020    2021     new_2021
a1 b1 c1 d1 e1 f1 534.385 540.210  540.210*(340.210/440.210)
a7 b7 c7 d7 e7 f7 544.385 550.210  numpy.nan
a4 b2 c4 d4 e4 f4 554.385 560.210  560.210*((460.210+461.210)/560.210)

说明：
我想匹配3个数据帧的所有前5个字符串列，并创建一个新列，其中对年份列的计算很少。df3是我的参考数据框架，希望根据df1和df2的变化率调整df3的年份列中的值。
例如：对于所有5列都匹配的行，则我希望执行df3['new_2021']=df3['2021']*（df1['2021']/df2['2021']）。
如果前5列中有多行具有相同的值。我想采用“年总和”列进行计算，如预期输出的第三行所示
如预期输出的第二行所示，如果在df1和df2中或两者中的df3的所有5列都找不到匹配项，我希望该行保持为空

我如何有效地做到这一点？我有非常大的数据帧。

您可以聚合

sum

，因为前5列中可能存在重复的值，然后通过

df5

中的列名为所有数据帧中的相同列设置索引名，因此可以进行分割和多次：

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df3 = df3.groupby(df3.columns[:5].tolist()).sum()

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                   2020    2021    new_2021
a  l  m  n  o                              
a1 b1 c1 d1 e1  534.385  540.21  836.214303
a4 b2 c4 d4 e4  554.385  560.21  219.002457
a7 b7 c7 d7 e7  544.385  550.21         NaN

编辑：在

df3

中使用重复的

multi-index

是否可行，但需要更多步骤：

print (df3)
    a   l   m   n   o   p     2020    2021
0  a1  b1  c1  d1  e1  f1  534.385  540.21
1  a7  b7  c7  d7  e7  f7  544.385  550.21
2  a4  b2  c4  d4  e4  f4  554.385  560.21
3  a1  b1  c1  d1  e1  f1  534.385  200.00
4  a7  b7  c7  d7  e7  f7  544.385  800.00
5  a4  b2  c4  d4  e4  f4  554.385  500.00

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())

#convert first 5 columns to index and sorting
df3 = df3.set_index(df3.columns[:5].tolist()).sort_index()

#create unique MultiIndex from df3 and change index in df1, df2
mux = pd.MultiIndex.from_frame(df3.index.to_frame().drop_duplicates())
df1 = df1.reindex(mux)
df2 = df2.reindex(mux)
print (df2)
                   2020    2021
a  l  m  n  o                  
a1 b1 c1 d1 e1  434.385  440.21
a4 b2 c4 d4 e4  909.770  921.42
a7 b7 c7 d7 e7      NaN     NaN

请发布您的预期输出。@MayankPorwal发布了它。@NaveenKumar-它是重复的，如果从我的解决方案中删除

df3=df3.groupby（df3.columns[:5].tolist（））.sum（）

，该如何工作？如果删除groupby（它引入了多个索引），我无法确保只有在存在匹配时才执行乘法和除法，正确的？如果我错了，请纠正我。@NaveenKumar-你是对的，请尝试

df3=df3.groupby（df3.columns[:5].tolist（））.sum（）

更改为

df3=df3.set\u index（df3.columns[:5].tolist（））

@NaveenKumar-不幸的是，我失败了

value错误：无法处理非唯一的多索引，添加了在熊猫1.2.3中测试的解决方案，非常感谢。
a  l  m  n  o  p  2020    2021     new_2021
a1 b1 c1 d1 e1 f1 534.385 540.210  540.210*(340.210/440.210)
a7 b7 c7 d7 e7 f7 544.385 550.210  numpy.nan
a4 b2 c4 d4 e4 f4 554.385 560.210  560.210*((460.210+461.210)/560.210)

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df3 = df3.groupby(df3.columns[:5].tolist()).sum()

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                   2020    2021    new_2021
a  l  m  n  o                              
a1 b1 c1 d1 e1  534.385  540.21  836.214303
a4 b2 c4 d4 e4  554.385  560.21  219.002457
a7 b7 c7 d7 e7  544.385  550.21         NaN

print (df3)
    a   l   m   n   o   p     2020    2021
0  a1  b1  c1  d1  e1  f1  534.385  540.21
1  a7  b7  c7  d7  e7  f7  544.385  550.21
2  a4  b2  c4  d4  e4  f4  554.385  560.21
3  a1  b1  c1  d1  e1  f1  534.385  200.00
4  a7  b7  c7  d7  e7  f7  544.385  800.00
5  a4  b2  c4  d4  e4  f4  554.385  500.00

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())

#convert first 5 columns to index and sorting
df3 = df3.set_index(df3.columns[:5].tolist()).sort_index()

#create unique MultiIndex from df3 and change index in df1, df2
mux = pd.MultiIndex.from_frame(df3.index.to_frame().drop_duplicates())
df1 = df1.reindex(mux)
df2 = df2.reindex(mux)
print (df2)
                   2020    2021
a  l  m  n  o                  
a1 b1 c1 d1 e1  434.385  440.21
a4 b2 c4 d4 e4  909.770  921.42
a7 b7 c7 d7 e7      NaN     NaN

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                 p     2020    2021    new_2021
a  l  m  n  o                                  
a1 b1 c1 d1 e1  f1  534.385  540.21  836.214303
            e1  f1  534.385  200.00  309.588605
a4 b2 c4 d4 e4  f4  554.385  560.21  219.002457
            e4  f4  554.385  500.00  195.464609
a7 b7 c7 d7 e7  f7  544.385  550.21         NaN
            e7  f7  544.385  800.00         NaN