Python 如何使用groupby计算子字符串项的数量_Python_Pandas_Pandas Groupby

Python 如何使用groupby计算子字符串项的数量

python pandas

Python 如何使用groupby计算子字符串项的数量,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我从这样的输入数据开始 email country_code 12345kinglobito94@hotmail.com RU 12345arturdyikan6211@gmail.com RU 12345leonardosebastianld.20@gmail.com PE 12345k23156876vs@hotmail.com RU 12345jhuillcag@ho

我从这样的输入数据开始

email                               country_code
12345kinglobito94@hotmail.com           RU
12345arturdyikan6211@gmail.com          RU
12345leonardosebastianld.20@gmail.com   PE
12345k23156876vs@hotmail.com            RU
12345jhuillcag@hotmail.com              PE
12345ergasovaskazon72@gmail.com         RU
12345myrzadaevajrat@gmail.com           RU
12345filomena@hotmail.com               BR
12345jppicotajose20@hotmail.com         BR
...                                    ...

打印时显示如下：

                                      email country_code
0            12345kinglobito94@hotmail.com           RU
1           12345arturdyikan6211@gmail.com           RU
2    12345leonardosebastianld.20@gmail.com           PE
3             12345k23156876vs@hotmail.com           RU
4               12345jhuillcag@hotmail.com           PE
5          12345ergasovaskazon72@gmail.com           RU
6            12345myrzadaevajrat@gmail.com           RU
7                12345filomena@hotmail.com           BR
8          12345jppicotajose20@hotmail.com           BR
...                                                 ...

分组非常简单：

country_code
AR     21
BR    340
PE    198
RU    402
US     39
Name: email, dtype: int64

但我想计算一下每个国家有多少hotmail和gmail域名

使用regex提取域名，然后使用groupby（）.size（）即

如果你不想增加一列，你也可以这样做

df.groupby(["country_code",df['email'].str.extract('@(.*?)\.',expand=False)]).size()

我们也可以使用

str.replace（）

，但我认为@Dark的变体更惯用：

In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
    ...:    .groupby(['country_code', 'domain'])['email']
    ...:    .count()
    ...:    .reset_index(name='count'))
    ...:
Out[17]:
  country_code   domain  count
0           BR  hotmail      2
1           PE    gmail      1
2           PE  hotmail      1
3           RU    gmail      3
4           RU  hotmail      2

In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
    ...:    .groupby(['country_code', 'domain'])['email']
    ...:    .count()
    ...:    .reset_index(name='count'))
    ...:
Out[17]:
  country_code   domain  count
0           BR  hotmail      2
1           PE    gmail      1
2           PE  hotmail      1
3           RU    gmail      3
4           RU  hotmail      2