Python 是否可能使用递归lambda函数实现数据帧“应用”?
我有一个表示递归父子关系的数据帧。这种情况下的数据称为“因子族” 每个因子族包含多个因子,这些因子经过加权,每个族的权重总和为100% 因子本身可以是因子族 递归的深度没有限制 e、 g 我已经用熊猫的以下数据帧表示了这一点Python 是否可能使用递归lambda函数实现数据帧“应用”?,python,pandas,dataframe,recursion,apply,Python,Pandas,Dataframe,Recursion,Apply,我有一个表示递归父子关系的数据帧。这种情况下的数据称为“因子族” 每个因子族包含多个因子,这些因子经过加权,每个族的权重总和为100% 因子本身可以是因子族 递归的深度没有限制 e、 g 我已经用熊猫的以下数据帧表示了这一点 python df = pd.DataFrame({ "code": ["a", "b", "c", "d", "e", "f", &
python
df = pd.DataFrame({
"code": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"weight": [0.1, 0.4, 0.5, 0.2, 0.3, 0.5, 0.1, 0.2, 0.7, 0.6, 0.4],
"parent_code":["", "", "", "a", "a", "a", "b", "b", "b", "h", "h"]
})
df.set_index("code", inplace=True)
df
输出:
|code|weight|parent_code|
|----|------|-----------|
|a |0.1 | |
|b |0.4 | |
|c |0.5 | |
|d |0.2 |a |
|e |0.3 |a |
|f |0.5 |a |
|g |0.1 |b |
|h |0.2 |b |
|i |0.7 |b |
|j |0.6 |h |
|k |0.4 |h |
|----|------|-----------|
然后我添加了一个计算列,它是一个因子的权重乘以其父权重。我称之为终端重量
因此,终端节点的终端权重之和(在本例中为c、d、e、f、g、k、l、i)为100%
python
def parent_weight(code, family_factors):
if code in family_factors.index:
return family_factors["weight"][code] * parent_weight(family_factors["parent_code"][code], family_factors)
else:
return 1
df["terminal_weight"] = df.apply(lambda x: parent_weight(x.name, df), axis=1)
df
输出
|code|weight|parent_code|terminal_weight|
|----|------|-----------| --------------|
|a |0.1 | |0.100 |
|b |0.4 | |0.400 |
|c |0.5 | |0.500 |
|d |0.2 |a |0.020 |
|e |0.3 |a |0.030 |
|f |0.5 |a |0.050 |
|g |0.1 |b |0.040 |
|h |0.2 |b |0.080 |
|i |0.7 |b |0.280 |
|j |0.6 |h |0.048 |
|k |0.4 |h |0.032 |
|----|------|-----------| --------------|
所以我的问题是:有没有更聪明的方法来实现这一点,这样我就不必定义parent\u weight
函数了?我是否可以将其放入传递给DataFrame.apply()
的lambda函数中
提前感谢我会这样做,在dataframe的子集上循环,并使用临时列存储链接的权重和当前测试的父项。注意,我用np.nan值替换了df中的空白字符串
import pandas as pd
import numpy as np
df = pd.DataFrame({
"code": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"weight": [0.1, 0.4, 0.5, 0.2, 0.3, 0.5, 0.1, 0.2, 0.7, 0.6, 0.4],
"parent_code":[np.nan, np.nan, np.nan, "a", "a", "a", "b", "b", "b", "h", "h"]
})
df['temp'] = df['parent_code']
df['terminal_weight'] = df['weight']
while True:
parents = df[df.temp.notnull()][['temp']].drop_duplicates(keep='first').copy()
if len(parents)==0:
break
parents = df[['code', 'terminal_weight', 'parent_code']].merge(
parents.rename({"temp":"code"}, axis=1),
on="code",
how="inner"
)
parents.rename(
{'terminal_weight':'weight_parent', 'code':'parent_code_temp', 'parent_code':'temp'},
axis=1,
inplace=True
)
df = df.rename({'temp':'parent_code_temp'}, axis=1).merge(
parents,
on='parent_code_temp',
how='left'
)
df.drop('parent_code_temp', axis=1, inplace=True)
df["weight_parent"]= df["weight_parent"].fillna(1)
df['terminal_weight'] = df['terminal_weight'] * df["weight_parent"]
df.drop(['weight_parent'], axis=1, inplace=True)
df.drop('temp', axis=1, inplace=True)
print(df)
我会这样做,在dataframe的子集上循环,并使用临时列存储链接的权重和当前测试的父项。注意,我用np.nan值替换了df中的空白字符串
import pandas as pd
import numpy as np
df = pd.DataFrame({
"code": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"],
"weight": [0.1, 0.4, 0.5, 0.2, 0.3, 0.5, 0.1, 0.2, 0.7, 0.6, 0.4],
"parent_code":[np.nan, np.nan, np.nan, "a", "a", "a", "b", "b", "b", "h", "h"]
})
df['temp'] = df['parent_code']
df['terminal_weight'] = df['weight']
while True:
parents = df[df.temp.notnull()][['temp']].drop_duplicates(keep='first').copy()
if len(parents)==0:
break
parents = df[['code', 'terminal_weight', 'parent_code']].merge(
parents.rename({"temp":"code"}, axis=1),
on="code",
how="inner"
)
parents.rename(
{'terminal_weight':'weight_parent', 'code':'parent_code_temp', 'parent_code':'temp'},
axis=1,
inplace=True
)
df = df.rename({'temp':'parent_code_temp'}, axis=1).merge(
parents,
on='parent_code_temp',
how='left'
)
df.drop('parent_code_temp', axis=1, inplace=True)
df["weight_parent"]= df["weight_parent"].fillna(1)
df['terminal_weight'] = df['terminal_weight'] * df["weight_parent"]
df.drop(['weight_parent'], axis=1, inplace=True)
df.drop('temp', axis=1, inplace=True)
print(df)