Python 按分组变量中的上一个值(年份)标记滚动重复

Python 按分组变量中的上一个值(年份)标记滚动重复,python,pandas,Python,Pandas,我试图弄清楚是否有任何ID发生在早些年(即dfo中的replicate列)。如果是这样,我希望将该行标记为重复,并包括ID首次出现的年份(即year\u duplicate) 我确实有一个工作代码 目标:我想学习更好的(或“pythonic”)方法以更好的方式解决这个问题 更好的方法,也就是说,如果有更简洁的方法来解决这个问题,我将非常感谢您的帮助。我不太熟悉numpy和pandas 样本输入 dfi.to_dict() = {'Year': {0: 2020, 1: 2020, 2:

我试图弄清楚是否有任何
ID
发生在早些年(即
dfo
中的
replicate
列)。如果是这样,我希望将该行标记为重复,并包括
ID
首次出现的年份(即
year\u duplicate

我确实有一个工作代码

目标:我想学习更好的(或“pythonic”)方法以更好的方式解决这个问题 更好的方法,也就是说,如果有更简洁的方法来解决这个问题,我将非常感谢您的帮助。我不太熟悉
numpy
pandas

样本输入

dfi.to_dict() = 
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}
样本输出:

dfo.to_dict()
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
 'Duplicate': {0: False,
  1: False,
  2: False,
  3: True,
  4: False,
  5: True,
  6: False,
  7: True,
  8: True},
 'Year_Duplicate': {0: nan,
  1: nan,
  2: nan,
  3: 2020.0,
  4: nan,
  5: 2020.0,
  6: nan,
  7: 2020.0,
  8: 2021.0}}
import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE
工作代码:

dfo.to_dict()
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
 'Duplicate': {0: False,
  1: False,
  2: False,
  3: True,
  4: False,
  5: True,
  6: False,
  7: True,
  8: True},
 'Year_Duplicate': {0: nan,
  1: nan,
  2: nan,
  3: 2020.0,
  4: nan,
  5: 2020.0,
  6: nan,
  7: 2020.0,
  8: 2021.0}}
import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE
我很乐意回答任何澄清。谢谢你帮助我


以下评论的澄清:

dfo.to_dict()
{'Year': {0: 2020,
  1: 2020,
  2: 2020,
  3: 2021,
  4: 2021,
  5: 2021,
  6: 2022,
  7: 2022,
  8: 2022},
 'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
 '$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
 'Duplicate': {0: False,
  1: False,
  2: False,
  3: True,
  4: False,
  5: True,
  6: False,
  7: True,
  8: True},
 'Year_Duplicate': {0: nan,
  1: nan,
  2: nan,
  3: 2020.0,
  4: nan,
  5: 2020.0,
  6: nan,
  7: 2020.0,
  8: 2021.0}}
import pandas as pd
from numpy import nan as NA

dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)

df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()

indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})

df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA

dfo.equals(df_process) #returns TRUE
  • $
    只是一个表示销售额的数字。这可能会被忽略 重复
  • Year\u Duplicate
    向我们显示该ID出现的第一年 发生。如果没有副本,则不需要复制
    Year\u Duplicate
    在这种情况下,我们会将其留空
您可以使用
groupby().cumcount

df['Duplicated'] = df.groupby('ID')['Year'].cumcount().gt(0)
df['Year_Duplicated'] = df['Year'].where(df['Duplicated'])
输出:

    Year  ID  $  Duplicated  Year_Duplicated
0  2020   1  1       False              NaN
1  2020   2  1       False              NaN
2  2020   3  1       False              NaN
3  2021   1  2        True           2021.0
4  2021   4  2       False              NaN
5  2021   2  2        True           2021.0
6  2022   5  3       False              NaN
7  2022   1  3        True           2022.0
8  2022   4  3        True           2022.0
与和一起使用:

详细信息

print (df.groupby('ID')['Year'].transform('first'))
0    2020
1    2020
2    2020
3    2020
4    2021
5    2020
6    2022
7    2020
8    2021
Name: Year, dtype: int64

这将产生列
Year\u Duplicate
数据框
dfo
中指示的方式:

dfi['Duplicate'] = dfi.duplicated(subset='ID', keep='first')
first_year = dfi.groupby('ID')['Year'].first()
dfi['Year_Duplicate'] = dfi.loc[dfi['Duplicate'], 'ID'].map(first_year)
输出

   Year  ID  $  Duplicate  Year_Duplicate
0  2020   1  1      False             NaN
1  2020   2  1      False             NaN
2  2020   3  1      False             NaN
3  2021   1  2       True          2020.0
4  2021   4  2      False             NaN
5  2021   2  2       True          2020.0
6  2022   5  3      False             NaN
7  2022   1  3       True          2020.0
8  2022   4  3       True          2021.0

dfo.equals(dfi) #True

“$”的作用是什么?
我们可以省略它吗?“$”只是代表销售。我们可以忽略它。你能给样品添加更多数据吗?因为不能100%确定我的答案是否符合
dfo
的要求,
Year\u Duplicate
列应该显示
ID
发生的最早年份,而不仅仅是
Year
列的副本,不是吗?目前接受的答案不正确。请仔细查看
Year\u Duplicate
列中的值。这就是我明确询问该专栏的含义的原因。这里的答案似乎与dfo不符。你能更正一下你的答案吗?