String 在python数据帧中的字符后替换子字符串
我是熊猫的新手,在这方面遇到了很多麻烦,尽管我进行了搜索,但还没有找到解决办法。希望你们中的一个能帮助我 我有一个熊猫数据框,其中有一列我正试图清理的电子邮件。例如:String 在python数据帧中的字符后替换子字符串,string,pandas,dataframe,String,Pandas,Dataframe,我是熊猫的新手,在这方面遇到了很多麻烦,尽管我进行了搜索,但还没有找到解决办法。希望你们中的一个能帮助我 我有一个熊猫数据框,其中有一列我正试图清理的电子邮件。例如: >>> email['EMAIL'] 0 testing@...com 1 NaN 2 I.am.ME@GAMIL.COM 3 FIRST.LAST.NAME@MAIL.CMO 4 EMAIL+REMOVE@
>>> email['EMAIL']
0 testing@...com
1 NaN
2 I.am.ME@GAMIL.COM
3 FIRST.LAST.NAME@MAIL.CMO
4 EMAIL+REMOVE@TESTING.COM
Name: EMAIL, dtype: object
我在这里尝试做了很多事情:
1) 将拼写错误的结尾(如CMO)替换为拼写正确的结尾(如COM)
2) 用正确的拼写替换拼写错误的域名
3) 将多个句点替换为“@”符号后的一个句点
def remove_periods(email):
email_split = email['EMAIL'].str.split('@')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
4) 如果有gmail帐户,请删除“@”符号之前的所有句点
5) 删除“+”符号后直至“@”符号的所有字符
因此,从上面的示例中,我将返回:
>>> email['EMAIL']
0 testing@.com
1 NaN
2 IamME@GMAIL.COM
3 FIRST.LAST.NAME@MAIL.COM
4 EMAIL@TESTING.COM
Name: EMAIL, dtype: object
我已经编写了许多不同的代码,并且不断遇到错误。以下是我迄今为止最好的猜测之一,即删除“@”符号后的多个句点
def remove_periods(email):
email_split = email['EMAIL'].str.split('@')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
我也可以打印多个其他版本,但它们也都返回错误
非常感谢你的帮助
import numpy as np
import pandas as pd
pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
'testing@...com', np.nan, 'I.am.ME@GAMIL.COM', 'FIRST.LAST.NAME@MAIL.CMO',
'EMAIL+REMOVE@TESTING.COM', 'gamil@bar...com', 'noperiods@localhost']})
email[['NAME', '@', 'ADDR']] = email['EMAIL'].str.rpartition('@')
# 1) replace misspelled endings (e.g. COM) with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '@' symbol.
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '@' sign if they have a gmail account
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '@' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')
# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['@'] + email['ADDR']
# clean up intermediate columns
# del email[['NAME', '@', 'ADDR']]
print(email)
屈服
EMAIL NAME @ ADDR NEW_EMAIL
0 testing@...com testing @ .com testing@.com
1 NaN NaN None None NaN
2 I.am.ME@GAMIL.COM IamME @ GMAIL.COM IamME@GMAIL.COM
3 FIRST.LAST.NAME@MAIL.CMO FIRST.LAST.NAME @ MAIL.COM FIRST.LAST.NAME@MAIL.COM
4 EMAIL+REMOVE@TESTING.COM EMAIL @ TESTING.COM EMAIL@TESTING.COM
5 gamil@bar...com gamil @ bar.com gamil@bar.com
6 noperiods@localhost noperiods @ localhost noperiods@localhost
“名称”列包含最后一个@
ADDR列保存最后一个@
之后的所有内容
我让姓名、地址栏可见(并且没有覆盖原始的电子邮件栏)
因此,更容易理解中间步骤 哇!非常感谢。我很惊讶你做得这么优雅。我仍然在编写这么多乏味的代码,试图做同样的事情!