Python 如何使用groupby计算子字符串项的数量
我从这样的输入数据开始Python 如何使用groupby计算子字符串项的数量,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我从这样的输入数据开始 email country_code 12345kinglobito94@hotmail.com RU 12345arturdyikan6211@gmail.com RU 12345leonardosebastianld.20@gmail.com PE 12345k23156876vs@hotmail.com RU 12345jhuillcag@ho
email country_code
12345kinglobito94@hotmail.com RU
12345arturdyikan6211@gmail.com RU
12345leonardosebastianld.20@gmail.com PE
12345k23156876vs@hotmail.com RU
12345jhuillcag@hotmail.com PE
12345ergasovaskazon72@gmail.com RU
12345myrzadaevajrat@gmail.com RU
12345filomena@hotmail.com BR
12345jppicotajose20@hotmail.com BR
... ...
打印时显示如下:
email country_code
0 12345kinglobito94@hotmail.com RU
1 12345arturdyikan6211@gmail.com RU
2 12345leonardosebastianld.20@gmail.com PE
3 12345k23156876vs@hotmail.com RU
4 12345jhuillcag@hotmail.com PE
5 12345ergasovaskazon72@gmail.com RU
6 12345myrzadaevajrat@gmail.com RU
7 12345filomena@hotmail.com BR
8 12345jppicotajose20@hotmail.com BR
... ...
分组非常简单:
country_code
AR 21
BR 340
PE 198
RU 402
US 39
Name: email, dtype: int64
但我想计算一下每个国家有多少hotmail和gmail域名使用regex提取域名,然后使用groupby().size()即 如果你不想增加一列,你也可以这样做
df.groupby(["country_code",df['email'].str.extract('@(.*?)\.',expand=False)]).size()
我们也可以使用
str.replace()
,但我认为@Dark的变体更惯用:
In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
...: .groupby(['country_code', 'domain'])['email']
...: .count()
...: .reset_index(name='count'))
...:
Out[17]:
country_code domain count
0 BR hotmail 2
1 PE gmail 1
2 PE hotmail 1
3 RU gmail 3
4 RU hotmail 2
In [17]: (df.assign(domain=df['email'].str.replace(r'.*?@(.*?)\.\w+', r'\1'))
...: .groupby(['country_code', 'domain'])['email']
...: .count()
...: .reset_index(name='count'))
...:
Out[17]:
country_code domain count
0 BR hotmail 2
1 PE gmail 1
2 PE hotmail 1
3 RU gmail 3
4 RU hotmail 2