我是否可以对df.column的元素进行分类,并创建一个包含输出的列,而无需迭代(Python-Np)?

我是否可以对df.column的元素进行分类,并创建一个包含输出的列,而无需迭代(Python-Np)?,python,numpy,pandas,iteration,Python,Numpy,Pandas,Iteration,考虑到这个数据帧 A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]], columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5]) 我想根据列“A”的大小对其元素进行分类,并创建一个新列,其输出如下: In [26]: A['Size'] = "" for index, row in A.iterrows():

考虑到这个数据帧

A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
             columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5])
我想根据列“A”的大小对其元素进行分类,并创建一个新列,其输出如下:

In [26]: A['Size'] = ""
         for index, row in A.iterrows():
             if row['A'] >= 4:
                 A.loc[index, 'Size'] = 'Big'
             if 2.5 < row['A'] < 4:
                 A.loc[index, 'Size'] = 'Medium'
             if 0 < row['A'] < 2.4:
                 A.loc[index, 'Size'] = 'Small'
假设同一类别有很多列和不同的参数,有没有更有效的方法


感谢

您可以使用
loc
作为布尔掩码,仅为满足条件的行分配,即使对于如此小的df,速度也会更快,对于较大的df,速度也会显著更快:

In [60]:

%%timeit 
A['Size'] = ""
for index, row in A.iterrows():
    if row['A'] >= 4:
        A.loc[index, 'Size'] = 'Big'
    if 2.5 < row['A'] < 4:
        A.loc[index, 'Size'] = 'Medium'
    if 0 < row['A'] < 2.4:
        A.loc[index, 'Size'] = 'Small'
100 loops, best of 3: 2.31 ms per loop
In [62]:

%%timeit
A.loc[A['A'] >=4, 'Size'] = 'Big'
A.loc[(A['A'] >= 2.5) & (A['A'] < 4), 'Size'] = 'Medium'
A.loc[A['A'] < 2.4, 'Size'] = 'Small'

100 loops, best of 3: 1.95 ms per loop
更新

有趣的是,对于50000行的数据帧,
loc
方法优于嵌套的
np方法。其中
方法:我得到4.24毫秒,而不是12.1毫秒

In [60]:

%%timeit 
A['Size'] = ""
for index, row in A.iterrows():
    if row['A'] >= 4:
        A.loc[index, 'Size'] = 'Big'
    if 2.5 < row['A'] < 4:
        A.loc[index, 'Size'] = 'Medium'
    if 0 < row['A'] < 2.4:
        A.loc[index, 'Size'] = 'Small'
100 loops, best of 3: 2.31 ms per loop
In [62]:

%%timeit
A.loc[A['A'] >=4, 'Size'] = 'Big'
A.loc[(A['A'] >= 2.5) & (A['A'] < 4), 'Size'] = 'Medium'
A.loc[A['A'] < 2.4, 'Size'] = 'Small'

100 loops, best of 3: 1.95 ms per loop
In [64]:

%%timeit
A['Size'] = np.where(A['A'] < 2.4, 'Small', np.where((A['A'] >= 2.5) & (A['A'] < 4), 'Medium', np.where(A['A'] >=4, 'Big','')))
1000 loops, best of 3: 828 µs per loop