Python 如何有条件地分隔单元格值并使用pandas添加到列中
比如说 testing.csv:Python 如何有条件地分隔单元格值并使用pandas添加到列中,python,pandas,dataframe,Python,Pandas,Dataframe,比如说 testing.csv: First Name Last Name Profile URL Ashleigh Phelps https://www.linkedin.com/in/ashleighephelps Jonathan https://www.linkedin.com/in/jonathantsegal Camilla Innes https://www.linkedin.com/in/camill
First Name Last Name Profile URL
Ashleigh Phelps https://www.linkedin.com/in/ashleighephelps
Jonathan https://www.linkedin.com/in/jonathantsegal
Camilla Innes https://www.linkedin.com/in/camilla-innes-61213628
Rachel https://www.linkedin.com/in/rachel-hudesman-335b8120
Michael https://www.linkedin.com/in/mikeitalia
Antonio https://www.linkedin.com/in/antoniomolinelli
Lauren Zsigray https://www.linkedin.com/in/lauren-zsigray-13b5aa25
我使用的代码只会分隔有连字符的部分,但是如何获得姓和名
df = pd.read_csv("testing.csv", sep=',', encoding="utf-8")
df = df[df['Last Name'].isnull()]
p = df.pop('Profile URL')
tmp_df = p.str.split('/')
df['Last Name'] = tmp_df.str[-1]
tmp1_df = df.pop('Last Name').str.split('-')
df['Last Name'] = tmp1_df.str[1:-1].str.join(sep='-')
df = pd.concat([df, p], axis=1)
print (df)
这将产生以下输出:
First Name Last Name Profile URL
Ashleigh Phelps https://www.linkedin.com/in/ashleighephelps
Jonathan https://www.linkedin.com/in/jonathantsegal
Camilla Innes https://www.linkedin.com/in/camilla-innes-61213628
Rachel hudesman https://www.linkedin.com/in/rachel-hudesman-335b8120
Michael https://www.linkedin.com/in/mikeitalia
Antonio https://www.linkedin.com/in/antoniomolinelli
Lauren Zsigray https://www.linkedin.com/in/lauren-zsigray-13b5aa25
预期输出:
First Name Last Name Profile URL
Ashleigh Phelps https://www.linkedin.com/in/ashleighephelps
Jonathan tsegal https://www.linkedin.com/in/jonathantsegal
Camilla Innes https://www.linkedin.com/in/camilla-innes-13628
Rachel hudesman https://www.linkedin.com/in/rachel-hudesman-33
Michael https://www.linkedin.com/in/mikeitalia
Antonio molinelli https://www.linkedin.com/in/antoniomolinelli
Lauren Zsigray https://www.linkedin.com/in/lauren-zsigray-13b5a
应该使用什么来获取此格式的输出试试下面的代码:
import pandas as pd
df = pd.read_csv("testing.csv", sep=',', encoding="utf-8")
df.fillna('', inplace=True)
def clear_data(x):
fname = x['First Name']
lname = x['Last Name'].strip()
url = x['Profile URL']
if not lname:
fname = fname.split(' ')[0]
url_name = url.split('/')[-1].split('-')
if len(url_name) > 1:
lname = url_name[-2].title()
else:
index_of_fname = url_name[0].lower().find(fname.lower())
if index_of_fname != -1:
index_of_fname += len(fname)
lname = url_name[0][index_of_fname:].title()
x['First Name'] = fname
x['Last Name'] = lname
else:
lname = lname.split('-')[0].strip()
x['Last Name'] = lname
return x
df.apply(clear_data, axis=1)
print(df)
好的,这条大的线起作用了:
df.loc[(df['Last Name']=='')&(df['First Name'].apply(lambda x: len(x.split()))>1), 'Last Name'] = df.loc[df['First Name'].apply(lambda x: len(x.split()))>1, 'First Name'].apply(lambda x: x.split()[1])
df.loc[(df['First Name'].apply(lambda x: len(x.split()))>1), 'First Name'] = df.loc[df['First Name'].apply(lambda x: len(x.split()))>1, 'First Name'].apply(lambda x: x.split()[0])
df.loc[(df['Last Name']=='')&(df['Profile URL'].apply(lambda x: len(x.split('-')))>1), 'Last Name'] = df.loc[df['Profile URL'].apply(lambda x: len(x.split('-')))>1, 'Profile URL'].apply(lambda x: x.split('-')[1])
df.loc[(df['Last Name']=='')&(df.apply(lambda x: x['First Name'].lower() in x['Profile URL'], axis=1)), 'Last Name'] = df.loc[(df['Last Name']=='')&(df.apply(lambda x: x['First Name'].lower() in x['Profile URL'], axis=1))].apply(lambda x: x['Profile URL'].split('/')[-1].replace(x['First Name'].lower(), ''), axis=1)
除了分隔一个单元格值,您需要知道名字的结尾和姓氏的开头。。。除非你有一个可能的名字的列表(如果你考虑主要名字的话可能是巨大的但可能的话),我不认为这只是一个细胞分离问题。让我们看看是否有人找到了更好的解决方案这只适用于此代码而不是csv文件中的数据您可以提供您的邮件id以便我可以将粘贴为测试的文件发送给您。csvso。。。给我一个链接:)让我们来看看。idk这会给出一个类似于
raisevalueerror('Must have equal len keys and value'ValueError:Must have equal len keys and value when setting with a iterable
的错误。我认为您的数据是有代表性的,此代码适用于您提供的数据。您唯一可以更改的是(df['Last Name'='')
转换为(df['Last Name'].isnull())
,因为我使用您提供的字符串数据创建了df。我可以将数据发送到非机密邮件中