在Python列中检索具有重复值的最后一个唯一值_Python_Pandas_Dataframe_For Loop_Pandas Groupby

在Python列中检索具有重复值的最后一个唯一值

python pandas dataframe for-loop

在Python列中检索具有重复值的最后一个唯一值,python,pandas,dataframe,for-loop,pandas-groupby,Python,Pandas,Dataframe,For Loop,Pandas Groupby,我有一个包含电子邮件地址、日期和工作功能的数据集。我渴望获得当前职务函数和上一职务函数（与当前职务函数不同的值，以及在当前职务函数之前保留的职务函数）。例如，约翰。k@abc.com目前是swe_mgr2，之前的工作职能是swe_mgr1。我还渴望了解上一份工作的持续时间。持续时间很难计算，因为开始日期是以随机方式捕获的，但它可以基于上一个作业函数的开始日期的第一行捕获，直到当前作业函数的开始日期的第一行。例如，约翰。k@abc.com曾担任瑞典大学经理1的职务，从2018年8月30日至2019

我有一个包含电子邮件地址、日期和工作功能的数据集。我渴望获得当前职务函数和上一职务函数（与当前职务函数不同的值，以及在当前职务函数之前保留的职务函数）。例如，约翰。k@abc.com目前是swe_mgr2，之前的工作职能是swe_mgr1。我还渴望了解上一份工作的持续时间。持续时间很难计算，因为开始日期是以随机方式捕获的，但它可以基于上一个作业函数的开始日期的第一行捕获，直到当前作业函数的开始日期的第一行。例如，约翰。k@abc.com曾担任瑞典大学经理1的职务，从2018年8月30日至2019年6月1日（即10个月）

数据集

email              startdate       jobfunction
john.k@abc.com     01-01-2018      swe_ic1
john.k@abc.com     01-03-2018      swe_ic2
john.k@abc.com     30-08-2018      swe_mgr1
john.k@abc.com     01-06-2019      swe_mgr2  
john.k@abc.com     01-06-2020      swe_mgr2
greg.h@abc.com     30-01-2018      mkt_ic2
greg.h@abc.com     01-06-2018      mkt_ic3 
greg.h@abc.com     07-09-2018      mkt_mgr1
greg.h@abc.com     12-12-2018      mkt_mgr2
greg.h@abc.com     15-01-2019      mkt_mgr2 
greg.h@abc.com     05-06-2019      mkt_mgr2
greg.h@abc.com     01-06-2020      mkt_mgr3
joseph.c@abc.com   01-06-2019      sales_ic1
joseph.c@abc.com   01-06-2020      sales_mgr1

预期产量为

email             current_function     previous_function      duration_previous_function
john.k@abc.com    swe_mgr2             swe_mgr1                10mths
greg.h@abc.com    mkt_mgr3             mkt_mgr2                18mths
joseph.c@abc.com  sales_mgr1           sales_ic1               12mths

我被困在尝试获得上一份工作的第一步

这段代码似乎用于检索当前作业函数，而不是上一个作业函数

df2 = df.groupby('email').last().sort_index().reset_index().drop_duplicates()

我还想知道这是否可以通过使用循环通过每个电子邮件地址来实现，但是下面的代码不起作用

emails = df['email']
assigndate = df['startdate']
jobname = df['jobfunction']

for i in emails:
    prevjob = jobname.apply(lambda x: x.unique([-2]))

感谢您提供的任何形式的帮助和帮助。

您可以在转换为datetimes后，首先按排序列，然后按获取最后的重复项，然后为过滤器的前2行创建计数器，然后为多索引中的新级别创建计数器，由和创建：

最后减去按和转换为月份期间的列：

你真快！这似乎可行，但是，当前作业函数的结果似乎是前一个作业函数，反之亦然。如果我们只是将列重命名为rename（columns={0:'current'，1:'previous'}）@wjie08-谢谢，最后一次编辑-删除

。sort_index（level=[1]，升序=False，axis=1）

查看最终df中列的正确顺序。

df['startdate'] = pd.to_datetime(df['startdate'], dayfirst=True)

df = (df.sort_values(['email','startdate'], ascending=[True, False])
        .drop_duplicates(['email','jobfunction'], keep='last'))
df['g'] = df.groupby('email').cumcount()
df1 = df[df['g'].lt(2)].copy()
df1 = (df1.set_index(['email','g'])
          .unstack()
          .rename(columns={0:'current',1:'previous'}))
df1.columns = [f'{b}_{a}' for a,b in df1.columns]
df1 = df1.reset_index()

df1['duration_previous_function'] = (df1.pop('current_startdate')
                                        .dt.to_period('m')
                                        .astype('int')
                                        .sub(df1.pop('previous_startdate')
                                                .dt.to_period('m')
                                                .astype('int')))
print (df1)
              email current_jobfunction previous_jobfunction  \
0    greg.h@abc.com            mkt_mgr3             mkt_mgr2   
1    john.k@abc.com            swe_mgr2             swe_mgr1   
2  joseph.c@abc.com          sales_mgr1            sales_ic1   

   duration_previous_function  
0                          18  
1                          10  
2                          12