Python 创建值为另一列的新列，条件是跨索引共享多个其他列_Python_Pandas_Dataframe_Filtering

Python 创建值为另一列的新列，条件是跨索引共享多个其他列

python pandas dataframe

Python 创建值为另一列的新列，条件是跨索引共享多个其他列,python,pandas,dataframe,filtering,Python,Pandas,Dataframe,Filtering,我在大熊猫中使用了一个数据集，对玩过的游戏进行了大约118k的观察，每个观察应该有两个条目。当我第一次遇到条目A时，我需要根据当前观察中的三个值找到另一个观察值，并使用另一列的值创建一个新列。很抱歉，如果这不能在所有设备上正确呈现…我不确定如何在SO上格式化熊猫表，但我的数据排序如下： date | user_a_id | user_b_id | a_points | b_points | b_wins | a_result 0 12.1 20834 65168

我在大熊猫中使用了一个数据集，对玩过的游戏进行了大约118k的观察，每个观察应该有两个条目。当我第一次遇到条目A时，我需要根据当前观察中的三个值找到另一个观察值，并使用另一列的值创建一个新列。很抱歉，如果这不能在所有设备上正确呈现…我不确定如何在SO上格式化熊猫表，但我的数据排序如下：

   date | user_a_id | user_b_id | a_points | b_points | b_wins | a_result
0  12.1     20834     65168         65165      10568      5         W
1  12.1     20834     84163         65165      88452     21         W
2  12.2     20834     61806         65165      25998     19         L
3  12.1     84163     20834         88452      65165     33         L
4  12.3     96844     10196         22609      167005    52         W

   date | user_a_id | user_b_id | a_points | b_points | b_wins | a_result | a_wins
1  12.1     20834     84163         65165      88452     21         W         33
3  12.1     84163     20834         88452      65165     33         L         21

每个玩家都有一堆额外的数据，但是我们需要在

b_wins

中创建一个新列。每一行都是一个游戏的故事，但是

a_结果

是用户a的游戏结果。

b_赢

是一个有用的数据，告诉我们一个玩家在比赛中有多少经验，我相信这会有很高的预测价值，所以放弃它是不明智的

在本例中，第1行和第3行讲述了同一个游戏的故事。我需要

df.iloc[3]在['b_wins']

处的值，以转到

df.iloc[1]

处名为

a_wins

的新列，反之亦然。由此产生的两个指标如下所示：

   date | user_a_id | user_b_id | a_points | b_points | b_wins | a_result
0  12.1     20834     65168         65165      10568      5         W
1  12.1     20834     84163         65165      88452     21         W
2  12.2     20834     61806         65165      25998     19         L
3  12.1     84163     20834         88452      65165     33         L
4  12.3     96844     10196         22609      167005    52         W

   date | user_a_id | user_b_id | a_points | b_points | b_wins | a_result | a_wins
1  12.1     20834     84163         65165      88452     21         W         33
3  12.1     84163     20834         88452      65165     33         L         21

关于数据的一些注意事项：

不是每个游戏都有一对。这些数据是从一个网站上刮下来的，非常混乱。有可能只有一个观察结果，这没关系
没有游戏ID，因此我只能匹配日期和切换的用户ID号
有很多重赛。因此，虽然我可以匹配切换的ID号，但我也无法按日期筛选它们
到目前为止，我的大部分工作都是在Colab笔记本上完成的。我第一次开始使用pythonshell，没有骰子

我所尝试的：

df['a\u wins']=df['user\u a\u id']应用（lambda x:df.loc[df[“user\u b\u id”]==x，“b\u wins”].值）

这种方法似乎偶尔奏效。我没有得到所有的价值，也没有得到重赛。要尝试按日期筛选，我尝试了：

for i in df['date']:
  grouped = df.groupby['date'].get_group(i)
  df['a_wins'] = grouped['user_a_id'].apply(lambda x: grouped.loc[grouped["user_b_id"] == x, "b_wins"].values)

也只是偶尔工作。两者都需要永远！：）

创建缺少的列：

# initialise a_wins, b_result
df['a_wins'] = None
df['b_result'] = df['a_result'].replace({'W':'L','L':'W'})

其思想是交换内容，使较小的

id

始终是

：

# which values to swap
df['swap'] = df['user_a_id'] > df['user_b_id']

创建具有相应列名的列表

# works for the data you posted, might want to adjust.
a_list = sorted([a for a in df.columns if 'a_' in a])
b_list = sorted([b for b in df.columns if 'b_' in b])

在满足切换条件的地方交换

内容：

for a, b in zip(a_list, b_list):
    df.loc[df['swap'], a], df.loc[df['swap'], b] = df[df['swap']][b], df[df['swap']][a]

for a, b in zip(a_list, b_list):
    df.loc[df['swap'], a], df.loc[df['swap'], b] = df[df['swap']][b], df[df['swap']][a]

输出：

date    user_a_id   user_b_id   a_points    b_points    b_wins  a_result    swap    a_wins  b_result
0   12.1    20834   65168   65165   10568   5   W   False   None    L
1   12.1    20834   84163   65165   88452   21  W   False   None    L
2   12.2    20834   61806   65165   25998   19  L   False   None    W
3   12.1    20834   84163   65165   88452   None    W   True    33  L
4   12.3    10196   96844   167005  22609   None    L   True    52  W

date    user_a_id   user_b_id   a_points    b_points    b_wins  a_result    a_wins  b_result    swap
0   12.1    20834   65168   65165   10568   5.0 W   33.0    L   False
1   12.1    20834   84163   65165   88452   21.0    W   33.0    L   False
2   12.2    20834   61806   65165   25998   19.0    L   33.0    W   False
3   12.1    84163   20834   88452   65165   33.0    L   21.0    W   True
4   12.3    96844   10196   22609   167005  52.0    W   NaN L   True

编辑：现在可以通过按

日期、用户a\u id、用户b\u id

分组并填写

无

值来复制条目：

df = df.groupby(['date','user_b_id', 'user_a_id'])[df.columns].fillna(method='ffill').fillna(method='bfill')

现在，您可以使用交换列恢复原始格式：

输出：

date    user_a_id   user_b_id   a_points    b_points    b_wins  a_result    swap    a_wins  b_result
0   12.1    20834   65168   65165   10568   5   W   False   None    L
1   12.1    20834   84163   65165   88452   21  W   False   None    L
2   12.2    20834   61806   65165   25998   19  L   False   None    W
3   12.1    20834   84163   65165   88452   None    W   True    33  L
4   12.3    10196   96844   167005  22609   None    L   True    52  W

date    user_a_id   user_b_id   a_points    b_points    b_wins  a_result    a_wins  b_result    swap
0   12.1    20834   65168   65165   10568   5.0 W   33.0    L   False
1   12.1    20834   84163   65165   88452   21.0    W   33.0    L   False
2   12.2    20834   61806   65165   25998   19.0    L   33.0    W   False
3   12.1    84163   20834   88452   65165   33.0    L   21.0    W   True
4   12.3    96844   10196   22609   167005  52.0    W   NaN L   True

经过大量的探索，我发现所有这些都只是切换a=>b和b=>a，而没有在任何观察中创建任何新数据。有趣的解决方案，虽然（超级快），所以我投了它的票，但它并没有解决我的问题

b_wins

将始终有一个值，即使我的数据框中没有匹配的游戏。我一定误解了你问题中的某些内容。不过，谢谢你的投票。我对它进行了编辑，这样它可以交换->填充缺少的值->交换回来。这在100k行数据帧中应该足够快。