Pandas str.contains和str.find的结果不同
在我看来,两者应该给出相同的答案:Pandas str.contains和str.find的结果不同,pandas,Pandas,在我看来,两者应该给出相同的答案: train = pd.read_csv('https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv') train.name.str.contains('Mr.').sum() (train.name.str.find('Mr.')>0).sum() 但产出是: 647 517 不同结果背后的原因是什么?差异在于str.contains也匹配Mrs.,因为是特殊的正则
train = pd.read_csv('https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv')
train.name.str.contains('Mr.').sum()
(train.name.str.find('Mr.')>0).sum()
但产出是:
647
517
不同结果背后的原因是什么?差异在于
str.contains
也匹配Mrs.
,因为
是特殊的正则字符(用于匹配任何字符)
我认为需要转义它或添加参数regex=False
:
print(train.name.str.contains('Mr\.').sum())
517
print(train.name.str.contains('Mr.', regex=False).sum())
517
print((train.name.str.find('Mr.')>0).sum())
517
测试差异:
a = train.loc[train.name.str.contains('Mr.'), 'name']
b = train.loc[(train.name.str.find('Mr.')>0), 'name']
c = pd.concat([a, b], axis=1, keys=('contains','find'))
c = c[c.isnull().any(axis=1)]
print (c)
contains find
1 Cumings, Mrs. John Bradley (Florence Briggs Th... NaN
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) NaN
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) NaN
9 Nasser, Mrs. Nicholas (Adele Achem) NaN
15 Hewlett, Mrs. (Mary D Kingcome) NaN
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... NaN
19 Masselmani, Mrs. Fatima NaN
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... NaN
31 Spencer, Mrs. William Augustus (Marie Eugenie) NaN
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) NaN
41 Turpin, Mrs. William John Robert (Dorothy Ann ... NaN
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) NaN
52 Harper, Mrs. Henry Sleeper (Myna Haxtun) NaN
53 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... NaN
66 Nye, Mrs. (Elizabeth Ramell) NaN
85 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... NaN
...
...
非常感谢。我想问你在编辑中明确提出的问题。