Python 如果两列中的一行包含相同的字符串
我有一个如下所示的数据框:Python 如果两列中的一行包含相同的字符串,python,string,pandas,dataframe,Python,String,Pandas,Dataframe,我有一个如下所示的数据框: id k1 k2 same 1 re_setup oo_setup true 2 oo_setup oo_setup true 3 alerting bounce false 4 bounce re_oversetup false 5 re_oversetup alerting false
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_oversetup false
5 re_oversetup alerting false
6 alerting_s re_setup false
7 re_oversetup oo_setup true
8 alerting bounce false
因此,我需要对包含或不包含字符串“setup”的行进行分类
And simple output would be:
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_setup false
5 re_setup alerting false
6 alerting_s re_setup false
7 re_setup oo_setup true
8 alerting bounce false
我已经尝试过这样做,但当我解释时,我在选择多个列时出错
data['same'] = data[data['k1', 'k2'].str.contains('setup')==True]
我认为您需要,因为它仅适用于系列
(一列):
然后添加以检查每行是否所有True
s
data['same'] = data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
或检查每行至少一个True
:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).any(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup True
4 5 re_setup alerting True
5 6 alerting_s re_setup True
6 7 re_setup oo_setup True
7 8 alerting bounce False
另一种针对元素检查的解决方案:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
如果只有两列简单的链条件与&
类似所有或|类似任何:
data['same'] = data['k1'].str.contains('setup') & data['k2'].str.contains('setup')
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
下面是另一个通用的reduce操作,无需apply
In [114]: np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
Out[114]: array([ True, True, False, True, True, True, True, False], dtype=bool)
细部
In [115]: df['same'] = np.logical_or.reduce(
[df[c].str.contains('setup') for c in ['k1', 'k2']])
In [116]: df
Out[116]:
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_oversetup True
4 5 re_oversetup alerting True
5 6 alerting_s re_setup True
6 7 re_oversetup oo_setup True
7 8 alerting bounce False
计时
小的
大的
如果我在安装之前没有下划线“u”,比如我的问题中的现在,我已经编辑过了,这会起作用吗?谢谢。是的,它只检查字符串设置
In [115]: df['same'] = np.logical_or.reduce(
[df[c].str.contains('setup') for c in ['k1', 'k2']])
In [116]: df
Out[116]:
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_oversetup True
4 5 re_oversetup alerting True
5 6 alerting_s re_setup True
6 7 re_oversetup oo_setup True
7 8 alerting bounce False
In [111]: df.shape
Out[111]: (8, 4)
In [108]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
1000 loops, best of 3: 421 µs per loop
In [109]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
1000 loops, best of 3: 2.01 ms per loop
In [110]: df.shape
Out[110]: (40000, 4)
In [112]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
10 loops, best of 3: 59.5 ms per loop
In [113]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
10 loops, best of 3: 88.4 ms per loop