Python 有条件地创建一个"；其他"；分类列中的类别_Python_Python 2.7_Pandas_Dataframe_Categorical Data

Python 有条件地创建一个"；其他"；分类列中的类别

python python-2.7 pandas dataframe

Python 有条件地创建一个"；其他"；分类列中的类别,python,python-2.7,pandas,dataframe,categorical-data,Python,Python 2.7,Pandas,Dataframe,Categorical Data,我有一个DataFramedf，有一列，category用下面的代码创建： import pandas as pd import random as rand from string import ascii_uppercase rand.seed(1010) df = pd.DataFrame() values = list() for i in range(0,1000): category = (''.join(rand.choice(ascii_uppercase) f

我有一个

DataFrame

df

，有一列，

category

用下面的代码创建：

import pandas as pd
import random as rand
from string import ascii_uppercase

rand.seed(1010)

df = pd.DataFrame()
values = list()
for i in range(0,1000):   
    category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
    values.append(category)

df['category'] = values

每个值的频率计数为：

df['category'].value_counts()
Out[95]: 
P    54
B    50
T    48
V    46
I    46
R    45
F    43
K    43
U    41
C    40
W    39
E    39
J    39
X    37
M    37
Q    35
Y    35
Z    34
O    33
D    33
H    32
G    32
L    31
N    31
S    29

我想在

df['category']

列中创建一个名为“other”的新值，并分配

df['category']

中所有

值小于35
的值
有人能帮我解决这个问题吗
如果你还需要我帮忙，请告诉我
从@EdChum建议的解决方案中编辑
import pandas as pd
import random as rand
from string import ascii_uppercase

rand.seed(1010)

df = pd.DataFrame()
values = list()
for i in range(0,1000):   
    category = (''.join(rand.choice(ascii_uppercase) for i in range(1)))
    values.append(category)

df['category'] = values
df['category'].value_counts()

df.loc[df['category'].isin((df['category'].value_counts([df['category'].value_‌counts() < 35]).index), 'category'] = 'other'

  File "<stdin>", line 1
    df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_‌counts() < 35]).index), 'category'] = 'other'
                                                                                   ^
SyntaxError: invalid syntax

将熊猫作为pd导入
随机导入为rand
从字符串导入ascii_大写字母
兰特种子（1010）
df=pd.DataFrame（）
值=列表（）
对于范围（01000）内的i：
category=（“”.join（rand.choice（ascii_大写）表示范围（1）中的i）
values.append（类别）
df['category']=值
df['category'].值_计数（）
df.loc[df['category'].isin（（df['category'].value）计数（[df['category'].value_‌计数（）<35]）。索引），“类别”]=“其他”
文件“”，第1行
df.loc[df['category'].isin（（df['category'].value_counts（）[df['category'].value_‌计数（）<35]）。索引），“类别”]=“其他”
^
SyntaxError:无效语法

请注意，我正在Spyder IDE上使用Python 2.7（我在iPython和Python控制台窗口中尝试了建议的解决方案）
您可以使用value\u counts
生成布尔掩码来屏蔽值，然后使用loc
将这些值设置为“其他”：
In [71]:
df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other'
df

Out[71]:
    category
0      other
1      other
2          A
3          V
4          U
5          D
6          T
7          G
8          S
9          H
10     other
11     other
12     other
13     other
14         S
15         D
16         B
17         P
18         B
19     other
20     other
21         F
22         H
23         G
24         P
25     other
26         M
27         V
28         T
29         A
..       ...
970        E
971        D
972    other
973        P
974        V
975        S
976        E
977    other
978        H
979        V
980        O
981    other
982        O
983        Z
984    other
985        P
986        P
987    other
988        O
989    other
990        P
991        X
992        E
993        V
994        B
995        P
996        B
997        P
998        Q
999        X

[1000 rows x 1 columns]

[71]中的
df.loc[df['category'].isin（（df['category'].value_counts（）[df['category'].value_counts（）<35]）索引），'category']='other'
df
出[71]：
类别
0其他
1其他
2A
3 V
4 U
5d
6吨
7克
8秒
9小时
10其他
11其他
12其他
13其他
14秒
15天
16 B
17便士
18 B
19其他
20其他
21楼
22小时
23克
24便士
25其他
26米
27 V
28吨
29 A
..       ...
970东
971 D
972其他
973便士
974伏
975秒
976 E
977其他
978小时
979伏
980度
981其他
982 O
983 Z
984其他
985便士
986便士
987其他
988 O
989其他
990便士
991 X
992 E
993伏
994 B
995便士
996 B
997便士
998 Q
999 X
[1000行x 1列]

分解上述内容：
In [74]:
df['category'].value_counts() < 35

Out[74]:
W    False
B    False
C    False
V    False
H    False
P    False
T    False
R    False
U    False
K    False
E    False
Y    False
M    False
F    False
O    False
A    False
D    False
Q    False
N     True
J     True
S     True
G     True
Z     True
I     True
X     True
L     True
Name: category, dtype: bool

In [76]:    
df['category'].value_counts()[df['category'].value_counts() < 35]

Out[76]:
N    34
J    33
S    33
G    33
Z    32
I    31
X    31
L    30
Name: category, dtype: int64

[74]中的
df['category'].值_计数（）<35
出[74]：
W错
B错
C错误
V错误
H错误
P错误
不假
R错
你错了
K假
虚假的
虚假的
我错了
F错误
哦，错
假的
D错误
Q错
是的
真的
这是真的
真的
真的
我是真的
X正确
我是真的
名称：类别，数据类型：布尔
在[76]中：
df['category'].value_counts（）[df['category'].value_counts（）<35]
出[76]：
N 34
J 33
S 33
G 33
Z 32
I 31
x31
L 30
名称：类别，数据类型：int64

然后，我们可以对.index
值使用isin
，并将行设置为“其他”
中有一个示例：
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'