如何根据dataframe python中列中列表的值进行分组
我有一个熊猫电影的数据框,像这样如何根据dataframe python中列中列表的值进行分组,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我有一个熊猫电影的数据框,像这样 id, name, genre, release_year 1 A [a,b,c] 2017 2 B [b,c] 2017 3 C [a,c] 2010 4 D [d,c] 2010 .... 我想根据类型列表中的值按电影分组。 我的预期产出是: year, genre, number_of_movies 2017 a 1 2017 b
id, name, genre, release_year
1 A [a,b,c] 2017
2 B [b,c] 2017
3 C [a,c] 2010
4 D [d,c] 2010
....
我想根据类型列表中的值按电影分组。
我的预期产出是:
year, genre, number_of_movies
2017 a 1
2017 b 2
2017 c 2
2010 a 1
2010 c 2
...
有人能帮我实现这一点吗?您可以通过constructor创建新的
数据帧,通过以下方式重塑并用于计数:
要获得性能,请使用itertools.chain
展平genre
列:
from itertools import chain
df = pd.DataFrame({
'genre' : list(
chain.from_iterable(df.genre.tolist())
),
'release_year' : df.release_year.repeat(df.genre.str.len())
})
df
genre release_year
0 a 2017
0 b 2017
0 c 2017
1 b 2017
1 c 2017
2 a 2010
2 c 2010
3 d 2010
3 c 2010
现在,对类型
和发布年份
进行分组,并找到每组的大小
:
df.groupby(
['genre', 'release_year'], sort=False
).size()\
.reset_index(name='number_of_movies')
genre release_year number_of_movies
0 a 2017 1
1 b 2017 2
2 c 2017 2
3 a 2010 1
4 c 2010 2
5 d 2010 1
另一个很酷的方法是使用计数器
,即
from collections import Counter
ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()
ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)
输出:
release_year genre number_of_movies
0 2010 a 1.0
1 2010 c 2.0
2 2010 d 1.0
3 2017 a 1.0
4 2017 b 2.0
5 2017 c 2.0
这里是一个collections.Counter
方法,它具有O(n)复杂性,并且不需要df.groupby
/df.apply
:
from collections import Counter
from itertools import product, chain
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 4],
'name': ['A', 'B', 'C', 'D'],
'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
'year': [2017, 2017, 2010, 2010]})
c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
for idx, x in df.iterrows()]))
# Counter({(2010, 'a'): 1,
# (2010, 'c'): 2,
# (2010, 'd'): 1,
# (2017, 'a'): 1,
# (2017, 'b'): 2,
# (2017, 'c'): 2})
df = pd.DataFrame.from_dict(c, orient='index')
# 0
# (2017, a) 1
# (2017, b) 2
# (2017, c) 2
# (2010, a) 1
# (2010, c) 2
# (2010, d) 1
你试过什么吗?谢谢你的回答。我尝试了解决办法。然而,我的类型是:[《历险记》、《喜剧》…]在第一步之后,我得到的输出如下:(2016,[):33,(2016,“”):206,(2016,'A'):40,(2016,'d'):28,(2016,'v'):20,(2016,'e'):83,(2016,'n'):70,(2016,'t'):61,(2016,'u'):21,(2016,'r'):59,(2016,,'):70,(2016,,):78,….然而,预期产出是(2016,'Adventure'),40…@Sitabja,我感到困惑。根据您的输入,我已经达到了您的预期产出(除了使用(年份,流派)作为索引)…看来是逻辑被误用了,或者你的数据不一样。它正在工作,实际上我的输入有一些问题,我能够解决。非常感谢您的快速响应
from collections import Counter
from itertools import product, chain
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3, 4],
'name': ['A', 'B', 'C', 'D'],
'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
'year': [2017, 2017, 2010, 2010]})
c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
for idx, x in df.iterrows()]))
# Counter({(2010, 'a'): 1,
# (2010, 'c'): 2,
# (2010, 'd'): 1,
# (2017, 'a'): 1,
# (2017, 'b'): 2,
# (2017, 'c'): 2})
df = pd.DataFrame.from_dict(c, orient='index')
# 0
# (2017, a) 1
# (2017, b) 2
# (2017, c) 2
# (2010, a) 1
# (2010, c) 2
# (2010, d) 1