Python 从列中获取唯一值,该列中的值不同,并根据条件将行拆分为多行
以下是数据帧的示例:Python 从列中获取唯一值,该列中的值不同,并根据条件将行拆分为多行,python,pandas,Python,Pandas,以下是数据帧的示例: df_movies['genres'].unique() array(['Action|Adventure|Science Fiction|Thriller', 'Adventure|Science Fiction|Thriller', 'Action|Adventure|Science Fiction|Fantasy', ..., 'Adventure|Drama|Action|Family|Foreign', 'C
df_movies['genres'].unique()
array(['Action|Adventure|Science Fiction|Thriller',
'Adventure|Science Fiction|Thriller',
'Action|Adventure|Science Fiction|Fantasy', ...,
'Adventure|Drama|Action|Family|Foreign',
'Comedy|Family|Mystery|Romance',
'Mystery|Science Fiction|Thriller|Drama'], dtype=object)
当我尝试
df_movies[df_movies['genres'].str.contains('|')]
这只列出了所有行,包括那些只有一个类别的类别,如“恐怖”、“纪录片”等
如何从该列获取所有唯一值?还有什么方法可以将每一行分解为多行,这样每一行只有一种与之相关联的类型?
|
是一种特殊字符。使用contains,它将用于连接多个条件。例如Series.str.contains('foo | seven')
与请求每行的值(称之为x
)相同:x中的'foo'或x中的'seven'
鉴于此,您的查询在x中被解释为''或在x中被解释为“”
,这对于所有行都是True
。要按字面意思使用字符“|”
,需要使用“\”
df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})
df['genres'].str.contains('\|')
0 True
1 False
Name: genres, dtype: bool
|
是一个特殊字符。使用contains,它将用于连接多个条件。例如Series.str.contains('foo | seven')
与请求每行的值(称之为x
)相同:x中的'foo'或x中的'seven'
鉴于此,您的查询在x中被解释为''或在x中被解释为“”
,这对于所有行都是True
。要按字面意思使用字符“|”
,需要使用“\”
df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})
df['genres'].str.contains('\|')
0 True
1 False
Name: genres, dtype: bool
这应该能奏效。我添加了“电影”列,因为我假设您的数据中有与类型相关的其他信息
# Recreate data
movies = ['movie_1',
'movie_2',
'movie_3',
'movie_4',
'movie_5',
'movie_6']
genres = ['Action|Adventure|Science Fiction|Thriller',
'Adventure|Science Fiction|Thriller',
'Action|Adventure|Science Fiction|Fantasy',
'Adventure|Drama|Action|Family|Foreign',
'Comedy|Family|Mystery|Romance',
'Mystery|Science Fiction|Thriller|Drama']
import pandas as pd
# Intialize empty dataframe
df = pd.DataFrame()
# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)
# Check to make sure data came in right
print(df.dtypes)
print(df.head())
import re
# Create Regex to split genres
regex = r"\|"
# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)
# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')
# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')
# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)
# Get unique genres
unique_genres = df_final['genres'].unique()
# Print results
print(df_final.head())
print(unique_genres)
这应该能奏效。我添加了“电影”列,因为我假设您的数据中有与类型相关的其他信息
# Recreate data
movies = ['movie_1',
'movie_2',
'movie_3',
'movie_4',
'movie_5',
'movie_6']
genres = ['Action|Adventure|Science Fiction|Thriller',
'Adventure|Science Fiction|Thriller',
'Action|Adventure|Science Fiction|Fantasy',
'Adventure|Drama|Action|Family|Foreign',
'Comedy|Family|Mystery|Romance',
'Mystery|Science Fiction|Thriller|Drama']
import pandas as pd
# Intialize empty dataframe
df = pd.DataFrame()
# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)
# Check to make sure data came in right
print(df.dtypes)
print(df.head())
import re
# Create Regex to split genres
regex = r"\|"
# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)
# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')
# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')
# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)
# Get unique genres
unique_genres = df_final['genres'].unique()
# Print results
print(df_final.head())
print(unique_genres)
谢谢这是有道理的。现在,如何获得所有唯一值的列表?@sumak在这种情况下,您可以使用.str.split
将每个项转换为自己的单元格。在这种情况下,您不需要转义角色,因此您可以执行df_movies['genres'].str.split(''124;',expand=True).stack().unique()
谢谢!这是有道理的。现在,如何获得所有唯一值的列表?@sumak在这种情况下,您可以使用.str.split
将每个项转换为自己的单元格。在这种情况下,您不需要转义角色,因此可以执行df_movies['genres'].str.split(''124;',expand=True).stack().unique()