Python 基于其他两列中的字符串创建数据框列_Python_Python 3.x_Pandas_Numpy_Dataframe

Python 基于其他两列中的字符串创建数据框列

python python-3.x pandas numpy dataframe

Python 基于其他两列中的字符串创建数据框列,python,python-3.x,pandas,numpy,dataframe,Python,Python 3.x,Pandas,Numpy,Dataframe,我有一个如下所示的数据框： boat_type boat_type_2 Not Known Not Known Not Known kayak ship Not Known Not Known Not Known ship Not Known boat_type boat_type_2 boat_type_final Not Known Not Known cruise Not Known kayak kayak s

我有一个如下所示的数据框：

boat_type   boat_type_2
Not Known   Not Known
Not Known   kayak
ship        Not Known
Not Known   Not Known
ship        Not Known

boat_type   boat_type_2  boat_type_final
Not Known   Not Known    cruise
Not Known   kayak        kayak
ship        Not Known    ship  
Not Known   Not Known    cruise
ship        Not Known    ship

我想创建第三列

boat\u type\u final

，它应该如下所示：

boat_type   boat_type_2
Not Known   Not Known
Not Known   kayak
ship        Not Known
Not Known   Not Known
ship        Not Known

boat_type   boat_type_2  boat_type_final
Not Known   Not Known    cruise
Not Known   kayak        kayak
ship        Not Known    ship  
Not Known   Not Known    cruise
ship        Not Known    ship

因此，基本上，如果

船型

和

船型

中都存在“未知”，则该值应为“巡航”。但是，如果前两列中有“未知”以外的字符串，则应使用该字符串填写

boat_type_final

，即“kayak”或“ship”

最优雅的方式是什么？我看到了一些选项，如

where

、创建函数和/或逻辑，我想知道一个真正的pythonista会做什么

以下是我目前的代码：

import pandas as pd
import numpy as np
data = [{'boat_type': 'Not Known', 'boat_type_2': 'Not Known'},
    {'boat_type': 'Not Known',  'boat_type_2': 'kayak'},
    {'boat_type': 'ship',  'boat_type_2': 'Not Known'},
    {'boat_type': 'Not Known',  'boat_type_2': 'Not Known'},
    {'boat_type': 'ship',  'boat_type_2': 'Not Known'}]
df = pd.DataFrame(data
df['phone_type_final'] = np.where(df.phone_type.str.contains('Not'))...

使用：

说明：

第一个

未知

到缺少的值：

print (df.replace('Not Known',np.nan))
  boat_type boat_type_2
0       NaN         NaN
1       NaN       kayak
2      ship         NaN
3       NaN         NaN
4      ship         NaN

然后通过按行向前填充替换

NaN

s：

print (df.replace('Not Known',np.nan).ffill(axis=1))
  boat_type boat_type_2
0       NaN         NaN
1       NaN       kayak
2      ship        ship
3       NaN         NaN
4      ship        ship

按位置选择最后一列：

如果可能，请添加：

如果只使用了几列，另一种解决方案是：

另一种解决方案是定义具有映射的函数：

def my_func(row):
    if row['boat_type']!='Not Known':
        return row['boat_type']
    elif row['boat_type_2']!='Not Known':
        return row['boat_type_2']
    else: 
        return 'cruise'

[注意：您没有提到当两列都不“未知”时应该发生什么。]

然后简单地应用函数：

df.loc[:,'boat_type_final'] = df.apply(my_func, axis=1)

print(df)

输出：

   boat_type boat_type_2 boat_type_final
0  Not Known   Not Known          cruise
1  Not Known       kayak           kayak
2       ship   Not Known            ship
3  Not Known   Not Known          cruise
4       ship   Not Known            ship

你能解释一下它是如何工作的吗？特别是这部分：

.ffill（axis=1）.iloc[：，-1]

@bzier-好的，给我一点时间。@bzier-答案被修改了。

df.loc[:,'boat_type_final'] = df.apply(my_func, axis=1)

print(df)

   boat_type boat_type_2 boat_type_final
0  Not Known   Not Known          cruise
1  Not Known       kayak           kayak
2       ship   Not Known            ship
3  Not Known   Not Known          cruise
4       ship   Not Known            ship