如何根据dataframe python中列中列表的值进行分组_Python_Python 3.x_Pandas_Dataframe_Pandas Groupby

如何根据dataframe python中列中列表的值进行分组

python python-3.x pandas dataframe

如何根据dataframe python中列中列表的值进行分组,python,python-3.x,pandas,dataframe,pandas-groupby,Python,Python 3.x,Pandas,Dataframe,Pandas Groupby,我有一个熊猫电影的数据框，像这样 id, name, genre, release_year 1 A [a,b,c] 2017 2 B [b,c] 2017 3 C [a,c] 2010 4 D [d,c] 2010 .... 我想根据类型列表中的值按电影分组。我的预期产出是： year, genre, number_of_movies 2017 a 1 2017 b

我有一个熊猫电影的数据框，像这样

id, name,     genre, release_year 
1    A    [a,b,c]     2017
2    B    [b,c]       2017
3    C    [a,c]       2010
4    D    [d,c]       2010
....

我想根据类型列表中的值按电影分组。我的预期产出是：

year, genre, number_of_movies
2017  a       1
2017  b       2
2017  c       2
2010  a       1
2010  c       2 
...

有人能帮我实现这一点吗？

您可以通过constructor创建新的

数据帧，通过以下方式重塑并用于计数：
要获得性能，请使用itertools.chain
展平genre
列：
from itertools import chain

df = pd.DataFrame({
      'genre' : list(
           chain.from_iterable(df.genre.tolist())
       ), 
      'release_year' : df.release_year.repeat(df.genre.str.len())
})

df
  genre  release_year
0     a          2017
0     b          2017
0     c          2017
1     b          2017
1     c          2017
2     a          2010
2     c          2010
3     d          2010
3     c          2010

现在，对类型
和发布年份
进行分组，并找到每组的大小
：
df.groupby(
     ['genre', 'release_year'], sort=False
 ).size()\
  .reset_index(name='number_of_movies')

  genre  release_year  number_of_movies
0     a          2017                 1
1     b          2017                 2
2     c          2017                 2
3     a          2010                 1
4     c          2010                 2
5     d          2010                 1

另一个很酷的方法是使用计数器
，即
from collections import Counter

ndf = df.groupby('release_year')['genre'].apply(lambda x : Counter(np.concatenate(x.values))).reset_index()

ndf = ndf.set_axis('release_year,genre,number_of_movies'.split(','),inplace=False,axis=1)

输出：
   release_year genre  number_of_movies
0          2010     a               1.0
1          2010     c               2.0
2          2010     d               1.0
3          2017     a               1.0
4          2017     b               2.0
5          2017     c               2.0

这里是一个collections.Counter
方法，它具有O（n）复杂性，并且不需要df.groupby
/df.apply
：
from collections import Counter
from itertools import product, chain
import pandas as pd

df = pd.DataFrame({'id': [1, 2, 3, 4],
                   'name': ['A', 'B', 'C', 'D'],
                   'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
                   'year': [2017, 2017, 2010, 2010]})

c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
                                 for idx, x in df.iterrows()]))

# Counter({(2010, 'a'): 1,
#          (2010, 'c'): 2,
#          (2010, 'd'): 1,
#          (2017, 'a'): 1,
#          (2017, 'b'): 2,
#          (2017, 'c'): 2})

df = pd.DataFrame.from_dict(c, orient='index')

#            0
# (2017, a)  1
# (2017, b)  2
# (2017, c)  2
# (2010, a)  1
# (2010, c)  2
# (2010, d)  1

你试过什么吗？谢谢你的回答。我尝试了解决办法。然而，我的类型是：[《历险记》、《喜剧》…]在第一步之后，我得到的输出如下：（2016，[）：33，（2016，“”）：206，（2016，'A'）：40，（2016，'d'）：28，（2016，'v'）：20，（2016，'e'）：83，（2016，'n'）：70，（2016，'t'）：61，（2016，'u'）：21，（2016，'r'）：59，（2016，，'）：70，（2016，，）：78，….然而，预期产出是（2016，'Adventure'），40…@Sitabja，我感到困惑。根据您的输入，我已经达到了您的预期产出（除了使用（年份，流派）作为索引）…看来是逻辑被误用了，或者你的数据不一样。它正在工作，实际上我的输入有一些问题，我能够解决。非常感谢您的快速响应
from collections import Counter
from itertools import product, chain
import pandas as pd

df = pd.DataFrame({'id': [1, 2, 3, 4],
                   'name': ['A', 'B', 'C', 'D'],
                   'genre': [['a', 'b', 'c'], ['b', 'c'], ['a', 'c'], ['d', 'c']],
                   'year': [2017, 2017, 2010, 2010]})

c = Counter(chain.from_iterable([list(product([x['year']], x['genre'])) \
                                 for idx, x in df.iterrows()]))

# Counter({(2010, 'a'): 1,
#          (2010, 'c'): 2,
#          (2010, 'd'): 1,
#          (2017, 'a'): 1,
#          (2017, 'b'): 2,
#          (2017, 'c'): 2})

df = pd.DataFrame.from_dict(c, orient='index')

#            0
# (2017, a)  1
# (2017, b)  2
# (2017, c)  2
# (2010, a)  1
# (2010, c)  2
# (2010, d)  1