Python 从列中获取唯一值，该列中的值不同，并根据条件将行拆分为多行_Python_Pandas

Python 从列中获取唯一值，该列中的值不同，并根据条件将行拆分为多行

python pandas

Python 从列中获取唯一值，该列中的值不同，并根据条件将行拆分为多行,python,pandas,Python,Pandas,以下是数据帧的示例： df_movies['genres'].unique() array(['Action|Adventure|Science Fiction|Thriller', 'Adventure|Science Fiction|Thriller', 'Action|Adventure|Science Fiction|Fantasy', ..., 'Adventure|Drama|Action|Family|Foreign', 'C

以下是数据帧的示例：

df_movies['genres'].unique()
array(['Action|Adventure|Science Fiction|Thriller',
       'Adventure|Science Fiction|Thriller',
       'Action|Adventure|Science Fiction|Fantasy', ...,
       'Adventure|Drama|Action|Family|Foreign',
       'Comedy|Family|Mystery|Romance',
       'Mystery|Science Fiction|Thriller|Drama'], dtype=object)

当我尝试

df_movies[df_movies['genres'].str.contains('|')]

这只列出了所有行，包括那些只有一个类别的类别，如“恐怖”、“纪录片”等

如何从该列获取所有唯一值？还有什么方法可以将每一行分解为多行，这样每一行只有一种与之相关联的类型？

是一种特殊字符。使用contains，它将用于连接多个条件。例如

Series.str.contains（'foo | seven'）

与请求每行的值（称之为

）相同：

x中的'foo'或x中的'seven'
鉴于此，您的查询在x中被解释为''或在x中被解释为“”
，这对于所有行都是True
。要按字面意思使用字符“|”
，需要使用“\”

df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})

df['genres'].str.contains('\|')
0     True
1    False
Name: genres, dtype: bool

|
是一个特殊字符。使用contains，它将用于连接多个条件。例如Series.str.contains（'foo | seven'）
与请求每行的值（称之为x
）相同：x中的'foo'或x中的'seven'
鉴于此，您的查询在x中被解释为''或在x中被解释为“”
，这对于所有行都是True
。要按字面意思使用字符“|”
，需要使用“\”

df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})

df['genres'].str.contains('\|')
0     True
1    False
Name: genres, dtype: bool

这应该能奏效。我添加了“电影”列，因为我假设您的数据中有与类型相关的其他信息
# Recreate data
movies = ['movie_1',
          'movie_2',
          'movie_3',
          'movie_4',
          'movie_5',
          'movie_6']

genres = ['Action|Adventure|Science Fiction|Thriller',
          'Adventure|Science Fiction|Thriller',
          'Action|Adventure|Science Fiction|Fantasy',
          'Adventure|Drama|Action|Family|Foreign',
          'Comedy|Family|Mystery|Romance',
          'Mystery|Science Fiction|Thriller|Drama']

import pandas as pd

# Intialize empty dataframe
df = pd.DataFrame()

# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)

# Check to make sure data came in right
print(df.dtypes)
print(df.head())

import re

# Create Regex to split genres
regex = r"\|"

# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)

# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')

# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')

# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)

# Get unique genres
unique_genres = df_final['genres'].unique()

# Print results
print(df_final.head())
print(unique_genres)

这应该能奏效。我添加了“电影”列，因为我假设您的数据中有与类型相关的其他信息
# Recreate data
movies = ['movie_1',
          'movie_2',
          'movie_3',
          'movie_4',
          'movie_5',
          'movie_6']

genres = ['Action|Adventure|Science Fiction|Thriller',
          'Adventure|Science Fiction|Thriller',
          'Action|Adventure|Science Fiction|Fantasy',
          'Adventure|Drama|Action|Family|Foreign',
          'Comedy|Family|Mystery|Romance',
          'Mystery|Science Fiction|Thriller|Drama']

import pandas as pd

# Intialize empty dataframe
df = pd.DataFrame()

# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)

# Check to make sure data came in right
print(df.dtypes)
print(df.head())

import re

# Create Regex to split genres
regex = r"\|"

# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)

# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')

# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')

# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)

# Get unique genres
unique_genres = df_final['genres'].unique()

# Print results
print(df_final.head())
print(unique_genres)

谢谢这是有道理的。现在，如何获得所有唯一值的列表？@sumak在这种情况下，您可以使用.str.split
将每个项转换为自己的单元格。在这种情况下，您不需要转义角色，因此您可以执行df_movies['genres'].str.split（''124;'，expand=True）.stack（）.unique（）
谢谢！这是有道理的。现在，如何获得所有唯一值的列表？@sumak在这种情况下，您可以使用.str.split
将每个项转换为自己的单元格。在这种情况下，您不需要转义角色，因此可以执行df_movies['genres'].str.split（''124;'，expand=True）.stack（）.unique（）