如何在python数据框架中找到唯一的列表项?
我有一个数据集,其中包含电影标题以及它所属的不同类型。每部电影都有不止一种类型。因此,对于整个数据集,我希望找到存在的唯一类型的总数 我不能使用如何在python数据框架中找到唯一的列表项?,python,pandas,Python,Pandas,我有一个数据集,其中包含电影标题以及它所属的不同类型。每部电影都有不止一种类型。因此,对于整个数据集,我希望找到存在的唯一类型的总数 我不能使用df.unique(),因为它是数据帧本身每列中的一个列表 movieId title genres 0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1 2 Jumanji (1995) Adventure|Children|Fantasy 2
df.unique()
,因为它是数据帧本身每列中的一个列表
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror
12 13 Balto (1995) Adventure|Animation|Children
13 14 Nixon (1995) Drama
14 15 Cutthroat Island (1995) Action|Adventure|Romance
15 16 Casino (1995) Crime|Drama
16 17 Sense and Sensibility (1995) Drama|Romance
17 18 Four Rooms (1995) Comedy
18 19 Ace Ventura: When Nature Calls (1995) Comedy
19 20 Money Train (1995) Action|Comedy|Crime|Drama|Thriller
20 21 Get Shorty (1995) Comedy|Crime|Thriller
21 22 Copycat (1995) Crime|Drama|Horror|Mystery|Thriller
22 23 Assassins (1995) Action|Crime|Thriller
23 24 Powder (1995) Drama|Sci-Fi
24 25 Leaving Las Vegas (1995) Drama|Romance
25 26 Othello (1995) Drama
26 27 Now and Then (1995) Children|Drama
27 28 Persuasion (1995) Drama|Romance
28 29 City of Lost Children, The (Cité des enfants p...
这是电影的数据集
在体裁专栏下,我想把动作、喜剧、犯罪、戏剧、惊悚片分为动作、喜剧、犯罪、戏剧、惊悚片
另外,对于现在作为数据帧的整个数据集,我希望找到唯一的类型。您可以按照以下步骤进行操作:
df = pd.DataFrame({'title':['Toy Story (1995)','Jumanji (1995)','Grumpier Old Men (1995)'],
'genres':['Adventure|Animation|Children|Comedy|Fantasy','Adventure|Children|Fantasy','Comedy|Romance']})
a = list(set([y for x in df['genres'] for y in x.split('|')]))
print(a)
输出:
['Animation', 'Comedy', 'Children', 'Fantasy', 'Adventure', 'Romance']
尝试使用以下方法:
temp = df.genres.str.split("|").tolist() # this will return a list of lists for all the genres
import functools
import operator
unique_genres = set(functools.reduce(operator.concat, temp)) #this will flatten the list of lists and ultimately call the set to get the unique genres. Use len to get the number of unique genres afterwards
请尝试以下操作:
df = pda.read_csv('movies.csv')
df['genres'] = df['genres'].apply(lambda x: x.strip().split('|'))
df['count'] = df['genres'].apply(lambda y: len(y))
print(df)
OUTPUT :
movie Id ... genres count
0 1 ... [Adventure, Animation, Children, Comedy, Fantasy] 5
1 2 ... [Adventure, Children, Fantasy] 3
2 3 ... [Comedy, Romance] 2
3 4 ... [Comedy, Drama, Romance] 3
4 5 ... [Comedy] 1
5 6 ... [Action, Crime, Thriller] 3
您是否尝试先将所有类型列收集到一个数组中,然后调用.unique()?不,还没有。我对python非常陌生,因此我对它不熟悉。我会试试的。我试过了,它确实有用。但它只是需要时间来运行。谢谢很高兴它成功了!不管怎么说,AkshayNevrekar的答案似乎更好这也很有效。但是ashish14给出的结果似乎更快。无论如何谢谢你!