使用Python/Pandas和可能的正则表达式从全名列表中提取姓氏_Python_String_List_Pandas

使用Python/Pandas和可能的正则表达式从全名列表中提取姓氏

python string list pandas

使用Python/Pandas和可能的正则表达式从全名列表中提取姓氏,python,string,list,pandas,Python,String,List,Pandas,我正在与一个数据集争论，最终我得到了一个如下形式的名称列表： s = ['DR. James Coffins', 'Zacharias Pallefas', 'Matthew Ebnel', 'Ranzzith Redly', 'GEORGE GEORGIADAKIS', 'HARISH KUMARAN K', 'Christiaan Kraanlen, CFA', 'Mary K. Lein, CFA, COL', 'Alexandre Cegra, CFA, CAIA'

我正在与一个数据集争论，最终我得到了一个如下形式的名称列表：

s = ['DR. James Coffins',
 'Zacharias Pallefas',
 'Matthew Ebnel',
 'Ranzzith Redly',
 'GEORGE GEORGIADAKIS',
 'HARISH KUMARAN K',
 'Christiaan Kraanlen, CFA',
 'Mary K. Lein, CFA, COL',
'Alexandre Cegra,  CFA,  CAIA'
 'Anna Bely']

我必须提取姓氏并将其放在单独的列表（或数据框中的列）中。然而，我对全名的多态性感到困惑，我是Python的新手

可能的算法如下所示：

Loop through the elements of the list.  For each element:
split the element into subelements using spaces. Then:

a) If there are four or less subelements start from the beginning and 
examine the first four subelements.
a1) If the first subelement is larger than 2 letters then: If the 
second subelement is larger than one letter, return the second 
subelement. Otherwise, return the third subelement.
a2) if the first subelement is 2 letters then drop it and repeat 
step a1

在跳过包含

且不在排除列表中的单词后，总是抓住每行的第二个元素如何

['dr'、'mr'、'mrs'、'mrs'、'miss'、'prof']

>>> exclude_tags = ['dr', 'mr', 'mrs', 'mrs', 'miss', 'prof']
>>> [[y for y in x.split() if '.' not in y and y.lower() not in exclude_tags][1].rstrip(',').capitalize() for x in s]
['Coffins', 'Pallefas', 'Ebnel', 'Redly', 'Georgiadakis', 'Kumaran', 'Kraanlen', 'Lein', 'Cegra']

对于其他遇到这个问题的人，请记住，一般来说，从全名中提取一个人的姓氏是不可能的，请阅读

Sunitha的解决方案将失败，因为任何人的姓氏由多个代币组成（梵高），有多个姓氏（冈萨雷斯·拉米雷斯），名字有多个代币（玛丽·简·沃森），选择将中间名放在创建此列表的任何系统中，来自亚洲文化，名字/姓氏的顺序有时会颠倒，等等。

你尝试了什么？看起来很直接的python字符串操作。我想知道是否有更聪明的方法，我认为Sunitha提供了一种。