我是否可以对df.column的元素进行分类,并创建一个包含输出的列,而无需迭代(Python-Np)?
考虑到这个数据帧我是否可以对df.column的元素进行分类,并创建一个包含输出的列,而无需迭代(Python-Np)?,python,numpy,pandas,iteration,Python,Numpy,Pandas,Iteration,考虑到这个数据帧 A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]], columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5]) 我想根据列“A”的大小对其元素进行分类,并创建一个新列,其输出如下: In [26]: A['Size'] = "" for index, row in A.iterrows():
A = pd.DataFrame([[1, 5, 2], [2, 4, 4], [3, 3, 1], [4, 2, 2], [5, 1, 4]],
columns=['A', 'B', 'C'], index=[1, 2, 3, 4, 5])
我想根据列“A”的大小对其元素进行分类,并创建一个新列,其输出如下:
In [26]: A['Size'] = ""
for index, row in A.iterrows():
if row['A'] >= 4:
A.loc[index, 'Size'] = 'Big'
if 2.5 < row['A'] < 4:
A.loc[index, 'Size'] = 'Medium'
if 0 < row['A'] < 2.4:
A.loc[index, 'Size'] = 'Small'
假设同一类别有很多列和不同的参数,有没有更有效的方法
感谢您可以使用
loc
作为布尔掩码,仅为满足条件的行分配,即使对于如此小的df,速度也会更快,对于较大的df,速度也会显著更快:
In [60]:
%%timeit
A['Size'] = ""
for index, row in A.iterrows():
if row['A'] >= 4:
A.loc[index, 'Size'] = 'Big'
if 2.5 < row['A'] < 4:
A.loc[index, 'Size'] = 'Medium'
if 0 < row['A'] < 2.4:
A.loc[index, 'Size'] = 'Small'
100 loops, best of 3: 2.31 ms per loop
In [62]:
%%timeit
A.loc[A['A'] >=4, 'Size'] = 'Big'
A.loc[(A['A'] >= 2.5) & (A['A'] < 4), 'Size'] = 'Medium'
A.loc[A['A'] < 2.4, 'Size'] = 'Small'
100 loops, best of 3: 1.95 ms per loop
更新
有趣的是,对于50000行的数据帧,loc
方法优于嵌套的np方法。其中
方法:我得到4.24毫秒,而不是12.1毫秒
In [60]:
%%timeit
A['Size'] = ""
for index, row in A.iterrows():
if row['A'] >= 4:
A.loc[index, 'Size'] = 'Big'
if 2.5 < row['A'] < 4:
A.loc[index, 'Size'] = 'Medium'
if 0 < row['A'] < 2.4:
A.loc[index, 'Size'] = 'Small'
100 loops, best of 3: 2.31 ms per loop
In [62]:
%%timeit
A.loc[A['A'] >=4, 'Size'] = 'Big'
A.loc[(A['A'] >= 2.5) & (A['A'] < 4), 'Size'] = 'Medium'
A.loc[A['A'] < 2.4, 'Size'] = 'Small'
100 loops, best of 3: 1.95 ms per loop
In [64]:
%%timeit
A['Size'] = np.where(A['A'] < 2.4, 'Small', np.where((A['A'] >= 2.5) & (A['A'] < 4), 'Medium', np.where(A['A'] >=4, 'Big','')))
1000 loops, best of 3: 828 µs per loop