Python 基于现有列上的条件创建新列
我有一个数据框,如下所示:Python 基于现有列上的条件创建新列,python,pandas,Python,Pandas,我有一个数据框,如下所示: col1 = ['a','b','c','a','c','a','b','c','a'] col2 = [1,1,0,1,1,0,1,1,0] df2 = pd.DataFrame(zip(col1,col2),columns=['name','count']) name count 0 a 1 1 b 1 2 c 0 3 a 1 4 c 1 5 a 0 6
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])
name count
0 a 1
1 b 1
2 c 0
3 a 1
4 c 1
5 a 0
6 b 1
7 c 1
8 a 0
我试图找出“name”列中每个元素对应的0个数与0+1之和的比率。
首先,我将计数加总如下:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO rations for {j} = {zero_pb}")
print(f"One ratios for {j} = {one_pb}")
print("="*30)
输出如下所示:
a
ZERO ratios for a = 0 0.5
dtype: float64
One ratios for a = 0 0.5
dtype: float64
==============================
b
ZERO ratios for b = 1 0.0
dtype: float64
One ratios for b = 1 1.0
dtype: float64
==============================
c
ZERO ratios for c = 2 0.333333
dtype: float64
One ratios for c = 2 0.666667
dtype: float64
==============================
我的目标是向数据框中添加两个新列:“name_0”和“name_1”,其中“name”列中的每个元素都有th比率值。我尝试了一些东西,但没有达到预期的效果:
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO Probablitliy for {j} = {zero_pb}")
print(f"One Probablitliy for {j} = {one_pb}")
print("="*30)
condition1 = [ df2['name'].eq(j) & df2['count'].eq(0)]
condition2 = [ df2['name'].eq(j) & df2['count'].eq(1)]
choice1 = zero_pb.tolist()
choice2 = one_pb.tolist()
print(f'choice1 = {choice1}, choice2 = {choice2}')
df2["name"+str("_0")] = np.select(condition1, choice1, default=0)
df2["name"+str("_1")] = np.select(condition2, choice2, default=0)
该列将使用name元素“c”的值进行更新。这是预期的,因为最后计算的值将用于更新所有值
你能帮我理解是否有其他有效使用np.select的方法吗
预期产出:
name count name_0 name_1
0 a 1 0.000000 0.500000
1 b 1 0.000000 1.000000
2 c 0 0.333333 0.000000
3 a 1 0.000000 0.500000
4 c 1 0.000000 0.666667
5 a 0 0.500000 0.000000
6 b 1 0.000000 1.000000
7 c 1 0.000000 0.666667
8 a 0 0.500000 0.000000
下面的代码修复了这个问题。但是,我无法找到使用numpy.select获得相同结果的方法
df2["name"+str("_0")] = 0.0
df2["name"+str("_1")] = 0.0
for j in df2.name.unique():
print(j)
zero_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0]
full_ct = zero_one_frequencies[zero_one_frequencies['name'] == j][0] + zero_one_frequencies[zero_one_frequencies['name'] == j][1]
zero_pb = zero_ct / full_ct
one_pb = 1 - zero_pb
print(f"ZERO Probablitliy for {j} = {zero_pb.tolist()[0]}")
print(f"One Probablitliy for {j} = {one_pb.tolist()[0]}")
print("="*30)
for idx in df2[df2['name']== j ].index:
print("Index:::", idx)
if df2['count'].iloc[idx] == 0:
df2.at[idx, "name"+str("_0")] = zero_pb.tolist()[0]
print(f'Count for {j} at index {idx} is {a}')
print('printing name_0: ', df2["name"+str("_0")].iloc[idx])
print("*"*30)
elif df2['count'].iloc[idx] == 1:
df2.at[idx, "name"+str("_1")] = one_pb.tolist()[0]
print(f'Count for {j} at index {idx} is {b}')
print('printing name_1: ', df2["name"+str("_1")].iloc[idx])
print("*"*30)
我没有访问零度一度频率df的权限。所以我冒昧地尝试用我的方式解决这个问题
import pandas as pd
import numpy as np
col1 = ['a','b','c','a','c','a','b','c','a']
col2 = [1,1,0,1,1,0,1,1,0]
df2 = pd.DataFrame(zip(col1,col2),columns=['name','count'])
df2["name_0"] = 0
df2["name_1"] = 0
for name in df2['name'].unique():
df_name = df2[df2['name'] == name]
prob_1 = sum(df_name['count']/df_name.shape[0])
for count in df2['count'].unique():
indx = np.where((df2['name'] == name) & (df2['count'] == count))
df2["name_" + str(count)].loc[indx] = np.abs(((count +1) % 2) - prob_1)
输出:
name count name_0 name_1
0 a 1 0.000000 0.500000
1 b 1 0.000000 1.000000
2 c 0 0.333333 0.000000
3 a 1 0.000000 0.500000
4 c 1 0.000000 0.666667
5 a 0 0.500000 0.000000
6 b 1 0.000000 1.000000
7 c 1 0.000000 0.666667
8 a 0 0.500000 0.000000
为了理解np。请选择我建议看。请根据
df2
发布您的预期输出。嗨,梅亚克:编辑我的帖子以获得更好的澄清。关于清晰的代码@Oddaspa,请点击链接。它看起来比我的要干净得多:)`zero\u one\u frequencies=pd.crosstab(df2['name',df2['count'])\.reset\u index().rename(columns={'index':'count'})\.rename\u axis(None,axis='columns')`这是我使用的zero\u one\u frequencies代码