Pandas 拆分列并提取国家、城市和组织名称

Pandas 拆分列并提取国家、城市和组织名称,pandas,dataframe,split,Pandas,Dataframe,Split,我有一个数据框,地址列如下。我想把这个专栏分开,这样国家、城市和机构就可以分成不同的专栏。具有挑战性的部分是每个细胞都有不同的结构。所有这些单元格的共同点是它们以城市、国家结尾,但在某些情况下,如行索引3,会有多个条目 id address ------------------------------------------------------------------------------------------------ 0 223 Department

我有一个数据框,地址列如下。我想把这个专栏分开,这样国家、城市和机构就可以分成不同的专栏。具有挑战性的部分是每个细胞都有不同的结构。所有这些单元格的共同点是它们以城市、国家结尾,但在某些情况下,如行索引3,会有多个条目

    id      address
------------------------------------------------------------------------------------------------
0   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
1   223     Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
2   223     Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
3   223     Department of Surgery, Närebro University Hospital, Närebro; Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden'}, {'id': '9900', 'name': 'Närebro universitet, Institutionen för läkarutbildning
4   223     Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
5   223     Division of Gastrointestinal Surgery, Nottingham Digestive Diseases Centre, National Institute for Health Research, Biomedical Research Unit, Nottingham University Hospitals, Queen's Medical Centre, Nottingham, UK
6   223     Hospital of Lausanne (CHUV), Lausanne, Switzerland
7   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
8   223     Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
9   223     Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
有人能帮忙吗


注意上面的数据框是我的数据框的子集,这就是为什么id列对所有行都有相同的值。原始数据帧约有10k行,因此无法在此处共享。

对于10k行数据库来说,这可能过于简单,但希望能为您指明正确的方向

请注意,行索引3的格式不正确,因为它有花括号等——看起来像是创建/删除数据时的解析问题。在下面的示例中,这一点被忽略,实际上,您希望清理输入或解决上游问题

首先,我根据您的数据创建一个玩具数据集:

import pandas as pd
from io import StringIO
raw_data = StringIO(
"""
!Id!address
0!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
1!223!Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
2!223!Department of Surgery, University Hospital Maastricht, Maastricht, The Netherlands; NUTRIM School for Nutrition, Toxicology and Metabolism, Maastricht University, Maastricht, The Netherlands
3!223!Department of Surgery, Närebro University Hospital, Närebro; Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden'}, {'id': '9900', 'name': 'Närebro universitet, Institutionen för läkarutbildning
4!223!Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
5!223!Division of Gastrointestinal Surgery, Nottingham Digestive Diseases Centre, National Institute for Health Research, Biomedical Research Unit, Nottingham University Hospitals, Queen's Medical Centre, Nottingham, UK
6!223!Hospital of Lausanne (CHUV), Lausanne, Switzerland
7!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
8!223!Clinical Surgery, University of Edinburgh, Royal Infirmary of Edinburgh, Edinburgh, UK
9!223!Department of GI and HPB Surgery, University Hospital Northern Norway, Breivika, Tromsø, Norway; Institute of Clinical Medicine, University of Tromsø, Tromsø, Norway
""")
data = pd.read_csv(raw_data, index_col=0, delimiter='!')
接下来,对于具有多个地址的行,我将它们拆分并放在数据帧中的单独行上。我想它们之间用“;”隔开总是这样,就像你的例子一样

data['address'] = data['address'].str.split(';')
data = data.explode('address')
接下来,我通过将地址拆分为“,”来标记地址。此处
address\u tokens
列将包含在此之后的标记列表

data['address_tokens'] = data['address'].str.split(',')
现在,对于每一行,我们将令牌组合成一个包含三个元素的列表,其中包含[令牌[0:N-3],通过逗号连接在一起,令牌[N-2],令牌[N-1]),我们将它们标识为机构、城市和国家

data['address_3'] = data['address_tokens'].apply(lambda tks: [','.join(tks[:-3]), tks[-2], tks[-1]] )
data[['institution', 'city', 'country']] = data['address_3'].apply(pd.Series)

我将所有中间步骤都保存在数据框中,以便您可以看到结果。三列
['institution'、'city'、'country']
包含您要求的内容,除了原始行索引3中的{、}etc出现一些问题之外

您可以列出所有国家,另一列列出城市,然后,您可以使用正则表达式来提取正确的字符串。剩下的将是机构。你的逻辑似乎是合理的。你能提供你的代码片段吗?这里的用法是。