Python 按数据帧分组并选择最常用的值_Python_Pandas_Group By_Pandas Groupby_Mode

Python 按数据帧分组并选择最常用的值

python pandas

Python 按数据帧分组并选择最常用的值,python,pandas,group-by,pandas-groupby,mode,Python,Pandas,Group By,Pandas Groupby,Mode,我有一个包含三列字符串的数据框。我知道第三列中只有一个值对前两列的每个组合都有效。要清理数据，我必须按数据帧按前两列进行分组，并为每个组合选择第三列的最常用值我的代码： import pandas as pd from scipy import stats source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 'City' : ['New-York', 'New-York',

我有一个包含三列字符串的数据框。我知道第三列中只有一个值对前两列的每个组合都有效。要清理数据，我必须按数据帧按前两列进行分组，并为每个组合选择第三列的最常用值

我的代码：

import pandas as pd
from scipy import stats

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

最后一行代码不起作用，它说“Key error‘Short name’”，如果我尝试只按城市分组，那么我会得到一个断言错误。我能做些什么来修复它呢？

对于

agg

，lambba函数会得到一个

序列，它没有的“短名称”
属性
stats.mode
返回由两个数组组成的元组，因此必须获取该元组中第一个数组的第一个元素
通过以下两个简单的更改：
source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

返回
                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

您可以使用value\u counts（）
获取计数序列，并获取第一行：
import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

如果您想在.agg（）中执行其他agg函数
试试这个
# Let's add a new col,  account
source['account'] = [1,2,3,3]

source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                        lambda x: x.value_counts().index[0]),
                                        avg = ('account', 'mean') \
                                      )

在这里玩游戏有点晚了，但是我在HYRY的解决方案中遇到了一些性能问题，所以我不得不想出另一个解决方案
它的工作原理是找到每个键值的频率，然后，对于每个键，只保留最常出现的值
另外还有一个支持多种模式的解决方案
在代表我使用的数据的规模测试中，这将运行时间从37.4s减少到了0.5s
以下是解决方案的代码、一些示例用法和规模测试：
import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

运行此代码将打印如下内容：
   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

希望这有帮助 正式地说，正确的答案是@eumiro解决方案。
@HYRY解的问题是，当你有一个像[1,2,3,4]这样的数字序列时，解是错误的，即。e、 ，您没有该模式。
例如：
如果你像@HYRY那样计算，你会得到：
>>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
        total  bla
client            
A           4   30
B           4   40
C           1   10
D           3   30
E           2   20

这显然是错误的（请参阅A值，该值应为1而不是4），因为它无法处理唯一的值
因此，另一种解决方案是正确的：
>>> import scipy.stats
>>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
        total  bla
client            
A           1   10
B           4   40
C           1   10
D           3   30
E           2   20

对于较大的数据集，一种稍微笨拙但更快的方法是获取感兴趣的列的计数，将计数从高到低排序，然后对子集进行重复消除，只保留最大的案例。代码示例如下所示：
>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb

问题是性能，如果您有很多行，这将是一个问题
如果是您的情况，请尝试以下方法：
import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()

如果需要另一种不依赖于value\u计数
或scipy.stats
的解决方法，可以使用计数器
集合
from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

可以像这样应用于上面的例子
src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)

熊猫>=0.16
pd.Series.mode可用！
使用，并将该功能应用于每个组：
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

如果需要将其作为数据帧，请使用
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

Series.mode
的有用之处在于它总是返回一个序列，这使得它与agg
和apply
非常兼容，尤其是在重建groupby输出时。它也更快
# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


处理多种模式
Series.mode
在有多种模式时也能很好地工作：

或者，如果希望每个模式都有一个单独的行，可以使用：
如果不关心返回哪种模式，只要是其中一种模式，则需要一个lambda来调用模式
并提取第一个结果
source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object


（不）考虑的备选方案
您也可以从python使用，但是
source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

…当必须处理多种模式时，它不能很好地工作；出现统计错误
。文件中提到了这一点：
如果数据为空，或者如果没有一个最常见的值，
统计误差被提出
但是你可以自己看
statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values

这里的两个主要答案表明：
df.groupby(cols).agg(lambda x:x.value_counts().index[0])

或者，最好是
df.groupby(cols).agg(pd.Series.mode)

但是，这两种方法在简单的边缘情况下都会失败，如下所示：
df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

第一：
df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

产生索引器
（因为组C
返回的序列为空）。第二点：
df.groupby(['client_id', 'date']).agg(pd.Series.mode)

返回ValueError:函数不减少，因为第一组返回两种模式的列表（因为有两种模式）。（如文件所述，如果第一组返回单一模式，这将起作用！）
这种情况下有两种可能的解决方案：
import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

以及cs95在评论中给我的解决方案：
然而，所有这些都很慢，不适合大型数据集。我最后使用的一个解决方案A）可以处理这些情况，b）速度更快，是对abw33答案的一个稍加修改的版本（应该更高）：
基本上，该方法一次只处理一个列并输出一个df，因此您不需要将第一个列作为df，而是将第一个列作为df，然后迭代地将输出数组（values.flatte（）
）作为列添加到df中。
如果不想包含NaN值，使用计数器
比pd.Series.mode
或pd.Series.value\u counts（）[0]
快得多：
def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)

应该有用。当您有NaN值时，这将失败，因为每个NaN都将被单独计算。
@ViacheslavNefedov-是的，但采用@HYRY的解决方案，它使用纯熊猫。不需要scipy.stats
。我发现stats.mode在字符串变量的情况下会显示错误的答案。这种方法看起来更可靠。如果不是。value\u counts（升序=False）
？@Private:升序=False
已经是默认值，因此不需要显式设置顺序。正如Jacquot所说，pd.Series.mode
现在更合适、更快了。我遇到了一个名为indexer的错误：索引0超出了大小为0的轴0的界限，如何解决它？这是我遇到的最快方法。。谢谢有没有办法直接在agg参数内使用此aproach？例如
df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

def get_most_common(srs):
    x = list(srs)
    my_counter = Counter(x)
    return my_counter.most_common(1)[0][0]

df.groupby(col).agg(get_most_common)