Python 试图通过对其他列应用条件来筛选数据帧中的列_Python_Pandas_Pandas Groupby

Python 试图通过对其他列应用条件来筛选数据帧中的列

python pandas

Python 试图通过对其他列应用条件来筛选数据帧中的列,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我在csv文件中有3列：帐户id、游戏变体、无游戏。。。。这张桌子看起来像这样 account_id game_variant no_of_games 130 a 2 145 c 1 130 b 4 130 c 1 142 a

我在csv文件中有3列：帐户id、游戏变体、无游戏。。。。这张桌子看起来像这样


account_id    game_variant   no_of_games
130               a             2
145               c             1
130               b             4
130               c             1
142               a             3
140               c             2
145               b             5

所以，我想提取变量a，b，c，a中的游戏数量∩b、 b∩c、 a∩c、 a∩B∩c

我能够通过使用game_variant进行分组并对no_of_游戏进行求和来分别提取a、b、c中玩的游戏，但无法逻辑地将其放入交叉点部分。请帮帮我

data_agg = df.groupby(['game_variant']).agg({'no_of_games':[np.sum]})

提前感谢

这里的解决方案将根据每个玩家的级别返回交叉口。这还使用了

defaultdict

，因为这种情况非常方便。我将以内联方式解释代码

from itertools import combinations
import pandas
from collections import defaultdict
from pprint import pprint  # only needed for pretty printing of dictionary

df = pandas.read_csv('df.csv', sep='\s+')  # assuming the data frame is in a file df.csv

# group by account_id to get subframes which only refer to one account.
data_agg2 = df.groupby(['account_id'])

# a defaultdict is a dictionary, where when no key is present, the function defined
# is used to create the element. This eliminates the check, if a key is
# already present or to set all combinations in advance.
games_played_2 = defaultdict(int)

# iterate over all accounts
for el in data_agg2.groups:
    # extract the sub-dataframe from the gouped function
    tmp = data_agg2.get_group(el)
    # print(tmp)  # you can uncomment this to see each account
    
    # This is in principle the same loop as suggested before. However, as not every
    # player has played all variants, one only has to create the number of combinations
    # necessary for that player
    for i in range(len(tmp.loc[:, 'no_of_games'])):
        # As now the game_variant is a column and not the index, the first part of zip
        # is slightly adapted. This loops over all combinations of variants for the
        # current account.
        for comb, combsum in zip(combinations(tmp.loc[:, 'game_variant'], i+1), combinations(tmp.loc[:, 'no_of_games'].values, i+1)):
            # Here, each variant combination gets a unique key. Comb is sorted, as the
            # variants might be not in alphabetic order. The number of games played for
            # each variant for that player are added to the value of all players before.
            games_played_2['_'.join(sorted(comb))] += sum(combsum)

pprint (games_played_2)

# returns
>> defaultdict(<class 'int'>,
            {'a': 5,
             'a_b': 6,
             'a_b_c': 7,
             'a_c': 3,
             'b': 9,
             'b_c': 11,
             'c': 4})

>> {'a': array([5], dtype=int64),
>>  'a_b': array([14], dtype=int64),
>>  'a_b_c': array([18], dtype=int64),
>>  'a_c': array([9], dtype=int64),
>>  'b': array([9], dtype=int64),
>>  'b_c': array([13], dtype=int64),
>>  'c': array([4], dtype=int64)}

'combinations（sequence，number）

sequence

中

number

元素的所有组合的迭代器。因此，要获得所有可能的组合，您必须迭代所有

数字

，从

到

len（序列

）

下一个

for

循环由两个迭代器组成：一个迭代器覆盖聚合数据的索引（

组合（data\u agg.index，i+1）

），一个迭代器覆盖在每个变量中玩的实际游戏数（

组合（data\u agg.loc[：，'no\u of_games'）。值，i+1）

）因此，

comb

应该始终是变量的列表，combsum应该是每个变量玩的游戏数的列表。请注意，要获得所有值，您必须使用

.loc[：，'no_games']

，而不是

.loc['no_games']

，因为后者搜索名为

'no_games'

的索引，而它是一个列名

然后，我将字典的键设置为变量列表的组合字符串，并将值设置为所玩游戏数的元素之和。

Hi@Jakob，我在games_played[''.join（comb）]=sum（combsum）行上得到这个eror（TypeError:sequence item 0:expected str instance，tuple found）.我应该怎么做才能解决这个问题？请你解释一下你的代码，我没有完全理解它，还有为什么你要使用聚合数据来获取交叉点，我的意思是，你怎么知道谁玩了变量a和b或其他一些组合，而没有帐户id？啊，好的，你想知道一个帐户在游戏中玩的游戏数吗e变体组合？我不清楚这一点。那么你确实需要另一种方法。我会在一秒钟内添加一些解释。我不确定你为什么会出现此错误。如果我在上面发布的数据框中阅读（我复制并保存在文件

df.csv

），这对我很有用。可能是你的

'data\u agg'

索引不是

'game\u variants'

。你能将

打印（data\u agg.index）

的结果添加到你上面的帖子中吗？

>> {'a': array([5], dtype=int64),
>>  'a_b': array([14], dtype=int64),
>>  'a_b_c': array([18], dtype=int64),
>>  'a_c': array([9], dtype=int64),
>>  'b': array([9], dtype=int64),
>>  'b_c': array([13], dtype=int64),
>>  'c': array([4], dtype=int64)}