Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 熊猫:依赖于其他值的列_Python_Pandas_If Statement_Conditional Statements_Multiple Columns - Fatal编程技术网

Python 熊猫:依赖于其他值的列

Python 熊猫:依赖于其他值的列,python,pandas,if-statement,conditional-statements,multiple-columns,Python,Pandas,If Statement,Conditional Statements,Multiple Columns,我有一个熊猫数据框,如下所示: col1 col2 col3 col4 0 5 1 11 9 1 2 3 14 7 2 6 5 54 8 3 11 2 67 44 4 23 8 2 23 5 1 5 9 8 6 9 7 45 71 我想创建一个第五列(col5),它依赖于col1的

我有一个熊猫数据框,如下所示:

   col1  col2  col3  col4
0     5     1    11     9
1     2     3    14     7
2     6     5    54     8
3    11     2    67    44
4    23     8     2    23
5     1     5     9     8
6     9     7    45    71
我想创建一个第五列(col5),它依赖于col1的值,并取其他列中的一个值

这是我想要的样子,但我有一些问题

if col1 < 3:
   col5 == col2
elif col1 < 7 & col1 >= 3:
   col5 == col3
elif col1 >= 7 & col1 < 50:
   col5 == col4
提前感谢,如果您有任何问题,请告诉我您可以使用多个,如果没有条件为
True
col1=>50
)已添加最后一个值
1

df['col5'] = np.where(df['col1'] <3, df['col2'], 
             np.where((df['col1'] <7) & (df['col1'] >=3 ), df['col3'], 
             np.where((df['col1'] >=7) & (df['col1'] <50 ), df['col4'], 1))) 
print (df)
   col1  col2  col3  col4  col5
0     5     1    11     9    11
1     2     3    14     7     3
2     6     5    54     8    54
3    11     2    67    44    44
4    23     8     2    23    23
5    97     5     9     8     1
6     9     7    45    71    71
len(df)=7000中的计时:

In [441]: %timeit df['col51'] = np.where(df['col1'] <3, df['col2'], np.where((df['col1'] <7) & (df['col1'] >=3 ), df['col3'], df['col4']))
The slowest run took 5.31 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.25 ms per loop

In [442]: %timeit df["col52"] = df.apply(lambda x: col52(x), axis=1)
1 loop, best of 3: 552 ms per loop

In [443]: %timeit df["col53"] = [col53(c1,c2,c3,c4) for c1,c2,c3,c4 in zip(df.col1,df.col2,df.col3,df.col4)]
100 loops, best of 3: 9.87 ms per loop
计时代码:

#change 1000 to 10000 for 70k
df = pd.concat([df]*1000).reset_index(drop=True)

def col52(x):
    if x["col1"] < 3:
        return x["col2"]
    elif x["col1"] >=3 and x["col1"] < 7:
        return x["col3"]
    elif x["col1"] >= 7 and x["col1"] < 50:
        return x["col4"] 
def col53(c1,c2,c3,c4):
    if c1 < 3:
        return c2
    elif c1 >=3 and c1 < 7:
        return c3
    elif c1>= 7 and c1< 50:
        return c4    

df['col51'] = np.where(df['col1'] <3, df['col2'], np.where((df['col1'] <7) & (df['col1'] >=3 ), df['col3'], df['col4']))       
df["col52"] = df.apply(lambda x: col52(x), axis=1)
df["col53"] = [col53(c1,c2,c3,c4) for c1,c2,c3,c4 in zip(df.col1,df.col2,df.col3,df.col4)]
print (df)
#将70k的1000更改为10000
df=pd.concat([df]*1000)。重置索引(drop=True)
def col52(x):
如果x[“col1”]<3:
返回x[“col2”]
elif x[“col1”]>=3和x[“col1”]<7:
返回x[“col3”]
elif x[“col1”]>=7和x[“col1”]<50:
返回x[“col4”]
def col53(c1、c2、c3、c4):
如果c1<3:
返回c2
如果c1>=3且c1<7:
返回c3
如果c1>=7且c1<50:
返回c4

df['col51']=np。其中(df['col1']一种方法是使用pd.DataFrame.apply函数:

    def col5(x):
        if x["col1"] < 3:
            return x["col2"]
        elif x["col1"] >=3 and x["col1"] < 7:
            return x["col3"]
        elif x["col1"] >= 7 and x["col1"] < 50:
            return x["col4"]              

另外,一般来说,apply可能非常慢,特别是当您有带有if-else块的函数时,因为对于每一行,您的处理器必须决定应执行if-else块中的哪个语句(“流水线”和“分支预测”)。不过你在这里应该没问题。

是列数,逻辑是固定的吗?

是的,列和逻辑是固定的。
col1>50的逻辑是什么?
?太棒了!col52比col53慢得多是因为数据帧吗?列访问比行访问快得多吗?谢谢!
In [446]: %timeit df['col51'] = np.where(df['col1'] <3, df['col2'], np.where((df['col1'] <7) & (df['col1'] >=3 ), df['col3'], df['col4']))
100 loops, best of 3: 2.5 ms per loop

In [447]: %timeit df["col52"] = df.apply(lambda x: col52(x), axis=1)
1 loop, best of 3: 5.36 s per loop

In [448]: %timeit df["col53"] = [col53(c1,c2,c3,c4) for c1,c2,c3,c4 in zip(df.col1,df.col2,df.col3,df.col4)]
10 loops, best of 3: 96.3 ms per loop
#change 1000 to 10000 for 70k
df = pd.concat([df]*1000).reset_index(drop=True)

def col52(x):
    if x["col1"] < 3:
        return x["col2"]
    elif x["col1"] >=3 and x["col1"] < 7:
        return x["col3"]
    elif x["col1"] >= 7 and x["col1"] < 50:
        return x["col4"] 
def col53(c1,c2,c3,c4):
    if c1 < 3:
        return c2
    elif c1 >=3 and c1 < 7:
        return c3
    elif c1>= 7 and c1< 50:
        return c4    

df['col51'] = np.where(df['col1'] <3, df['col2'], np.where((df['col1'] <7) & (df['col1'] >=3 ), df['col3'], df['col4']))       
df["col52"] = df.apply(lambda x: col52(x), axis=1)
df["col53"] = [col53(c1,c2,c3,c4) for c1,c2,c3,c4 in zip(df.col1,df.col2,df.col3,df.col4)]
print (df)
    def col5(x):
        if x["col1"] < 3:
            return x["col2"]
        elif x["col1"] >=3 and x["col1"] < 7:
            return x["col3"]
        elif x["col1"] >= 7 and x["col1"] < 50:
            return x["col4"]              
    df["col5"] = df.apply(lambda x: col5(x), axis=1)