String 在python数据帧中的字符后替换子字符串

String 在python数据帧中的字符后替换子字符串,string,pandas,dataframe,String,Pandas,Dataframe,我是熊猫的新手,在这方面遇到了很多麻烦,尽管我进行了搜索,但还没有找到解决办法。希望你们中的一个能帮助我 我有一个熊猫数据框,其中有一列我正试图清理的电子邮件。例如: >>> email['EMAIL'] 0 testing@...com 1 NaN 2 I.am.ME@GAMIL.COM 3 FIRST.LAST.NAME@MAIL.CMO 4 EMAIL+REMOVE@

我是熊猫的新手,在这方面遇到了很多麻烦,尽管我进行了搜索,但还没有找到解决办法。希望你们中的一个能帮助我

我有一个熊猫数据框,其中有一列我正试图清理的电子邮件。例如:

>>> email['EMAIL']
0              testing@...com
1                         NaN
2           I.am.ME@GAMIL.COM
3    FIRST.LAST.NAME@MAIL.CMO
4    EMAIL+REMOVE@TESTING.COM
Name: EMAIL, dtype: object
我在这里尝试做了很多事情:

1) 将拼写错误的结尾(如CMO)替换为拼写正确的结尾(如COM)

2) 用正确的拼写替换拼写错误的域名

3) 将多个句点替换为“@”符号后的一个句点

def remove_periods(email):
    email_split = email['EMAIL'].str.split('@')
    ending = email_split.str.get(-1)
    ending = ending.str.replace('\.{2,}', '.') 
    emailupdate = email_split.str[:-1]
    emailupdate.append(ending)
    email_split.str.get()
    return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
4) 如果有gmail帐户,请删除“@”符号之前的所有句点

5) 删除“+”符号后直至“@”符号的所有字符

因此,从上面的示例中,我将返回:

>>> email['EMAIL']
0                testing@.com
1                         NaN
2             IamME@GMAIL.COM
3    FIRST.LAST.NAME@MAIL.COM
4           EMAIL@TESTING.COM
Name: EMAIL, dtype: object
我已经编写了许多不同的代码,并且不断遇到错误。以下是我迄今为止最好的猜测之一,即删除“@”符号后的多个句点

def remove_periods(email):
    email_split = email['EMAIL'].str.split('@')
    ending = email_split.str.get(-1)
    ending = ending.str.replace('\.{2,}', '.') 
    emailupdate = email_split.str[:-1]
    emailupdate.append(ending)
    email_split.str.get()
    return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
我也可以打印多个其他版本,但它们也都返回错误

非常感谢你的帮助

import numpy as np
import pandas as pd

pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
    'testing@...com', np.nan, 'I.am.ME@GAMIL.COM', 'FIRST.LAST.NAME@MAIL.CMO', 
    'EMAIL+REMOVE@TESTING.COM', 'gamil@bar...com', 'noperiods@localhost']})

email[['NAME', '@', 'ADDR']] = email['EMAIL'].str.rpartition('@')

# 1) replace misspelled endings (e.g. COM) with correct spellings 
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings 
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '@' symbol. 
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '@' sign if they have a gmail account 
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '@' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')

# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['@'] + email['ADDR']

# clean up intermediate columns
# del email[['NAME', '@', 'ADDR']]
print(email)
屈服

                      EMAIL             NAME     @         ADDR                 NEW_EMAIL
0            testing@...com          testing     @         .com              testing@.com
1                       NaN              NaN  None         None                       NaN
2         I.am.ME@GAMIL.COM            IamME     @    GMAIL.COM           IamME@GMAIL.COM
3  FIRST.LAST.NAME@MAIL.CMO  FIRST.LAST.NAME     @     MAIL.COM  FIRST.LAST.NAME@MAIL.COM
4  EMAIL+REMOVE@TESTING.COM            EMAIL     @  TESTING.COM         EMAIL@TESTING.COM
5           gamil@bar...com            gamil     @      bar.com             gamil@bar.com
6       noperiods@localhost        noperiods     @    localhost       noperiods@localhost
“名称”列包含最后一个
@
ADDR列保存最后一个
@
之后的所有内容

我让姓名、地址栏可见(并且没有覆盖原始的
电子邮件
栏)

因此,更容易理解中间步骤

哇!非常感谢。我很惊讶你做得这么优雅。我仍然在编写这么多乏味的代码,试图做同样的事情!