Python 有条件地创建系列/数据帧列_Python_Pandas_Numpy_Dataframe

Python 有条件地创建系列/数据帧列

python pandas numpy dataframe

Python 有条件地创建系列/数据帧列,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我有一个数据框，大致如下所示： Type Set 1 A Z 2 B Z 3 B X 4 C Y 我想在数据框中添加另一列（或生成一个序列），该列的长度与数据框相同（记录/行数相等），它设置了一个颜色'green'ifSet=='Z'和'red'ifSet等于任何其他颜色最好的方法是什么？如果您只有两个选择： df['color'] = np.where(

我有一个数据框，大致如下所示：

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

我想在数据框中添加另一列（或生成一个序列），该列的长度与数据框相同（记录/行数相等），它设置了一个颜色

'green'

Set=='Z'

和

'red'

Set

等于任何其他颜色

最好的方法是什么？

如果您只有两个选择：

df['color'] = np.where(df['Set']=='Z', 'green', 'red')

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

比如说,

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print(df)

屈服

  Set Type  color
0   Z    A  green
1   Z    B  green
2   X    B    red
3   Y    C    red

如果您有两种以上的情况，请使用。例如，如果您希望

颜色

黄色
当（df['Set']='Z'）和（df['Type']='A'）时
当（df['Set']='Z'）和（df['Type']='B'）时，蓝色

否则紫色
当时（df['Type']='B'）
否则为黑色

然后使用
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
conditions = [
    (df['Set'] == 'Z') & (df['Type'] == 'A'),
    (df['Set'] == 'Z') & (df['Type'] == 'B'),
    (df['Type'] == 'B')]
choices = ['yellow', 'blue', 'purple']
df['color'] = np.select(conditions, choices, default='black')
print(df)

产生
  Set Type   color
0   Z    A  yellow
1   Z    B    blue
2   X    B  purple
3   Y    C   black

实现这一目标的另一种方法是
df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

列表理解是有条件地创建另一列的另一种方法。如果使用列中的对象数据类型，如示例中所示，列表理解通常优于大多数其他方法
示例列表理解：
df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]

%timeit测试：
df['color'] = np.where(df['Set']=='Z', 'green', 'red')

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

下面的方法比计时的方法慢，但是我们可以基于多个列的内容计算额外的列，并且可以为额外的列计算两个以上的值
仅使用“设置”列的简单示例：
设置类型颜色
0 Z A红色
1 Z B红色
2 X B绿色
3 Y C绿色

考虑更多颜色和更多列的示例：
def set_color(row):
    if row["Set"] == "Z":
        return "red"
    elif row["Type"] == "C":
        return "blue"
    else:
        return "green"

df = df.assign(color=df.apply(set_color, axis=1))

print(df)

设置类型颜色
0 Z A红色
1 Z B红色
2 X B绿色
3 Y C蓝

编辑（2019年6月21日）：使用plydata
也可以使用来做这类事情（这似乎比使用assign
和apply
更慢）
简单的if\u else
：
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            

设置类型颜色
0 Z A红色
1 Z B红色
2 X B绿色
3 Y C绿色

嵌套的如果\u else
：
df = define(df, color=if_else('Set=="Z"', '"red"', '"green"'))

print(df)

df = define(df, color=if_else(
    'Set=="Z"',
    '"red"',
    if_else('Type=="C"', '"green"', '"blue"')))

print(df)                            

设置类型颜色
0 Z A红色
1 Z B红色
2 X B蓝色
3 Y C绿色
还有另一种剥猫皮的方法，使用字典将新值映射到列表中的键上：
def map_values(row, values_dict):
    return values_dict[row]

values_dict = {'A': 1, 'B': 2, 'C': 3, 'D': 4}

df = pd.DataFrame({'INDICATOR': ['A', 'B', 'C', 'D'], 'VALUE': [10, 9, 8, 7]})

df['NEW_VALUE'] = df['INDICATOR'].apply(map_values, args = (values_dict,))

它看起来像什么：
df
Out[2]: 
  INDICATOR  VALUE  NEW_VALUE
0         A     10          1
1         B      9          2
2         C      8          3
3         D      7          4

当您有许多ifelse
类型的语句要执行时（即，许多要替换的唯一值），这种方法非常强大
当然，你也可以这样做：
df['NEW_VALUE'] = df['INDICATOR'].map(values_dict)

但在我的机器上，这种方法的速度是上面的apply
方法的三倍多
您也可以使用dict.get
：
df['NEW_VALUE'] = [values_dict.get(v, None) for v in df['INDICATOR']]

您只需使用功能强大的.loc
方法，并根据需要使用一个或多个条件（使用pandas=1.0.5测试）
代码摘要：
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))
df['Color'] = "red"
df.loc[(df['Set']=="Z"), 'Color'] = "green"

#practice!
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"


说明：
df=pd.DataFrame(dict(Type='A B B C'.split(), Set='Z Z X Y'.split()))

# df so far: 
  Type Set  
0    A   Z 
1    B   Z 
2    B   X 
3    C   Y

添加“颜色”列并将所有值设置为“红色”
应用您的单一条件：
df.loc[(df['Set']=="Z"), 'Color'] = "green"


# df: 
  Type Set  Color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

或多个条件（如果需要）：
df.loc[(df['Set']=="Z")&(df['Type']=="B")|(df['Type']=="C"), 'Color'] = "purple"

您可以在此处阅读逻辑运算符和条件选择：
一个带有.apply（）
的衬里方法如下：
df['color'] = df['Set'].apply(lambda set_: 'green' if set_=='Z' else 'red')

之后，df
数据帧如下所示：
>>> print(df)
  Type Set  color
0    A   Z  green
1    B   Z  green
2    B   X    red
3    C   Y    red

如果您使用的是海量数据，最好采用记忆方法：
df['color'] = np.where(df['Set']=='Z', 'green', 'red')

import pandas as pd
import numpy as np

df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
%timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
%timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
%timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')

1000 loops, best of 3: 239 µs per loop
1000 loops, best of 3: 523 µs per loop
1000 loops, best of 3: 263 µs per loop

# First create a dictionary of manually stored values
color_dict = {'Z':'red'}

# Second, build a dictionary of "other" values
color_dict_other = {x:'green' for x in df['Set'].unique() if x not in color_dict.keys()}

# Next, merge the two
color_dict.update(color_dict_other)

# Finally, map it to your column
df['color'] = df['Set'].map(color_dict)

当您有许多重复的值时，这种方法将是最快的。我的一般经验法则是在以下情况下进行记忆：data\u size
10**4
&n\u distinct
data\u size/4

E.x.在一个案例中，用2500个或更少的不同值记忆10000行。
您可以使用pandas方法和：
或
输出：
  Type Set  color
1    A   Z  green
2    B   Z  green
3    B   X    red
4    C   Y    red

请注意，对于更大的数据帧（想想pd.DataFrame（{'Type'：list（'ABBC'）*100000，'Set'：list（'ZZXY'）*100000}）
-size），numpy.where
超过map
，但是列表理解是最重要的（大约比numpy.where
快50%）.如果条件需要多列信息，是否可以使用列表理解方法？我正在寻找类似的东西（这不起作用）：df['color']=['red'if（x['Set']='Z'）&（x['Type']='B'）else'green'表示df中的x]
将iterrows添加到数据框中，然后您可以通过行访问多个列：['red'if（row['Set']='Z'）&（row['Type']='B'）else'表示索引，在df.iterrows（）中的行]注意：如果您需要从数据帧中的另一个系列中获取替换值，例如df['color\u type']=np.where（df['Set']='Z'，'green'，df['type']）
@cheekybastart，或者不需要，那么这个好的解决方案将不起作用，因为.iterrows（）
是出了名的迟钝，在迭代时不应该修改数据帧。我喜欢这个答案，因为它显示了如何多次替换值，但在我的机器上，这种方法的速度是上面的apply方法的三倍多。您是如何对这些进行基准测试的？根据我的快速测量，.map（）
解决方案比.apply（）
快约10倍。更新：在100000000行上，52个字符串值，.apply（）
需要47秒，而.map（）
只需要5.91秒。好的，因此，只有两个不同的映射值，100000000行，运行时需要6.67秒“回忆录，和9.86秒，有.100000000行，52个不同的值，其中1个映射到第一个输出值，其余51个都对应于另一个：7.99秒，没有记忆，11.1秒，有。您的值是随机顺序的吗？还是背靠背？熊猫的高速度可能是因为以随机顺序缓存@AMCAre您的值？还是背靠背？值是随机的，使用random.choices（）
选择。我们如何使用这种类型的函数引用其他行？例如，如果行[“设置”]。移位（1）=“Z”：