Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/315.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从列中获取唯一值,该列中的值不同,并根据条件将行拆分为多行_Python_Pandas - Fatal编程技术网

Python 从列中获取唯一值,该列中的值不同,并根据条件将行拆分为多行

Python 从列中获取唯一值,该列中的值不同,并根据条件将行拆分为多行,python,pandas,Python,Pandas,以下是数据帧的示例: df_movies['genres'].unique() array(['Action|Adventure|Science Fiction|Thriller', 'Adventure|Science Fiction|Thriller', 'Action|Adventure|Science Fiction|Fantasy', ..., 'Adventure|Drama|Action|Family|Foreign', 'C

以下是数据帧的示例:

df_movies['genres'].unique()
array(['Action|Adventure|Science Fiction|Thriller',
       'Adventure|Science Fiction|Thriller',
       'Action|Adventure|Science Fiction|Fantasy', ...,
       'Adventure|Drama|Action|Family|Foreign',
       'Comedy|Family|Mystery|Romance',
       'Mystery|Science Fiction|Thriller|Drama'], dtype=object)
当我尝试

df_movies[df_movies['genres'].str.contains('|')]
这只列出了所有行,包括那些只有一个类别的类别,如“恐怖”、“纪录片”等


如何从该列获取所有唯一值?还有什么方法可以将每一行分解为多行,这样每一行只有一种与之相关联的类型?

|
是一种特殊字符。使用contains,它将用于连接多个条件。例如
Series.str.contains('foo | seven')
与请求每行的值(称之为
x
)相同:
x中的'foo'或x中的'seven'

鉴于此,您的查询在x中被解释为
''或在x中被解释为“”
,这对于所有行都是
True
。要按字面意思使用字符
“|”
,需要使用
“\”

df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})

df['genres'].str.contains('\|')
0     True
1    False
Name: genres, dtype: bool

|
是一个特殊字符。使用contains,它将用于连接多个条件。例如
Series.str.contains('foo | seven')
与请求每行的值(称之为
x
)相同:
x中的'foo'或x中的'seven'

鉴于此,您的查询在x中被解释为
''或在x中被解释为“”
,这对于所有行都是
True
。要按字面意思使用字符
“|”
,需要使用
“\”

df = pd.DataFrame({'genres': ['foo|bar', 'no_bar_here']})

df['genres'].str.contains('\|')
0     True
1    False
Name: genres, dtype: bool

这应该能奏效。我添加了“电影”列,因为我假设您的数据中有与类型相关的其他信息

# Recreate data
movies = ['movie_1',
          'movie_2',
          'movie_3',
          'movie_4',
          'movie_5',
          'movie_6']

genres = ['Action|Adventure|Science Fiction|Thriller',
          'Adventure|Science Fiction|Thriller',
          'Action|Adventure|Science Fiction|Fantasy',
          'Adventure|Drama|Action|Family|Foreign',
          'Comedy|Family|Mystery|Romance',
          'Mystery|Science Fiction|Thriller|Drama']

import pandas as pd

# Intialize empty dataframe
df = pd.DataFrame()

# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)

# Check to make sure data came in right
print(df.dtypes)
print(df.head())

import re

# Create Regex to split genres
regex = r"\|"

# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)

# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')

# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')

# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)

# Get unique genres
unique_genres = df_final['genres'].unique()

# Print results
print(df_final.head())
print(unique_genres)

这应该能奏效。我添加了“电影”列,因为我假设您的数据中有与类型相关的其他信息

# Recreate data
movies = ['movie_1',
          'movie_2',
          'movie_3',
          'movie_4',
          'movie_5',
          'movie_6']

genres = ['Action|Adventure|Science Fiction|Thriller',
          'Adventure|Science Fiction|Thriller',
          'Action|Adventure|Science Fiction|Fantasy',
          'Adventure|Drama|Action|Family|Foreign',
          'Comedy|Family|Mystery|Romance',
          'Mystery|Science Fiction|Thriller|Drama']

import pandas as pd

# Intialize empty dataframe
df = pd.DataFrame()

# Create dataframe from data
df['movies'] = movies
df['genres'] = genres
df['genres'] = df['genres'].astype(str)

# Check to make sure data came in right
print(df.dtypes)
print(df.head())

import re

# Create Regex to split genres
regex = r"\|"

# Split genres to new column and store values as a list
df['genres'] = df['genres'].str.split(regex)

# Create new dataframe with each genre from each list on a separate row
df_final = df.explode('genres')

# Join dataframes by index
df_final = df_final.join(df, rsuffix='_other')

# Drop unwanted columns
df_final = df_final.drop(['movies_other', 'genres_other'], axis = 1)

# Get unique genres
unique_genres = df_final['genres'].unique()

# Print results
print(df_final.head())
print(unique_genres)

谢谢这是有道理的。现在,如何获得所有唯一值的列表?@sumak在这种情况下,您可以使用
.str.split
将每个项转换为自己的单元格。在这种情况下,您不需要转义角色,因此您可以执行
df_movies['genres'].str.split(''124;',expand=True).stack().unique()
谢谢!这是有道理的。现在,如何获得所有唯一值的列表?@sumak在这种情况下,您可以使用
.str.split
将每个项转换为自己的单元格。在这种情况下,您不需要转义角色,因此可以执行
df_movies['genres'].str.split(''124;',expand=True).stack().unique()