Python 按分组变量中的上一个值(年份)标记滚动重复
我试图弄清楚是否有任何Python 按分组变量中的上一个值(年份)标记滚动重复,python,pandas,Python,Pandas,我试图弄清楚是否有任何ID发生在早些年(即dfo中的replicate列)。如果是这样,我希望将该行标记为重复,并包括ID首次出现的年份(即year\u duplicate) 我确实有一个工作代码 目标:我想学习更好的(或“pythonic”)方法以更好的方式解决这个问题 更好的方法,也就是说,如果有更简洁的方法来解决这个问题,我将非常感谢您的帮助。我不太熟悉numpy和pandas 样本输入 dfi.to_dict() = {'Year': {0: 2020, 1: 2020, 2:
ID
发生在早些年(即dfo
中的replicate
列)。如果是这样,我希望将该行标记为重复,并包括ID
首次出现的年份(即year\u duplicate
)
我确实有一个工作代码
目标:我想学习更好的(或“pythonic”)方法以更好的方式解决这个问题
更好的方法,也就是说,如果有更简洁的方法来解决这个问题,我将非常感谢您的帮助。我不太熟悉numpy
和pandas
样本输入
dfi.to_dict() =
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3}}
样本输出:
dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}
import pandas as pd
from numpy import nan as NA
dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)
df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()
indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})
df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA
dfo.equals(df_process) #returns TRUE
工作代码:
dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}
import pandas as pd
from numpy import nan as NA
dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)
df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()
indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})
df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA
dfo.equals(df_process) #returns TRUE
我很乐意回答任何澄清。谢谢你帮助我
以下评论的澄清:
dfo.to_dict()
{'Year': {0: 2020,
1: 2020,
2: 2020,
3: 2021,
4: 2021,
5: 2021,
6: 2022,
7: 2022,
8: 2022},
'ID': {0: 1, 1: 2, 2: 3, 3: 1, 4: 4, 5: 2, 6: 5, 7: 1, 8: 4},
'$': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3},
'Duplicate': {0: False,
1: False,
2: False,
3: True,
4: False,
5: True,
6: False,
7: True,
8: True},
'Year_Duplicate': {0: nan,
1: nan,
2: nan,
3: 2020.0,
4: nan,
5: 2020.0,
6: nan,
7: 2020.0,
8: 2021.0}}
import pandas as pd
from numpy import nan as NA
dfi=pd.DataFrame.from_dict(dfi)
dfo=pd.DataFrame.from_dict(dfo)
df_process = dfi.copy()
df_process['Duplicate']=df_process['ID'].duplicated()
indexes=df_process.groupby('ID')['Year'].idxmin
df_min_year = df_process[['Year','ID']].loc[indexes]
df_min_year=df_min_year.rename(columns={"Year": "Year_Duplicate"})
df_process=pd.merge(df_process,df_min_year,on=['ID'],how='left')
df_process.loc[df_process['Year_Duplicate']==df_process['Year'],'Year_Duplicate']=NA
dfo.equals(df_process) #returns TRUE
只是一个表示销售额的数字。这可能会被忽略 重复李>$
向我们显示该ID出现的第一年 发生。如果没有副本,则不需要复制Year\u Duplicate
在这种情况下,我们会将其留空Year\u Duplicate
groupby().cumcount
:
df['Duplicated'] = df.groupby('ID')['Year'].cumcount().gt(0)
df['Year_Duplicated'] = df['Year'].where(df['Duplicated'])
输出:
Year ID $ Duplicated Year_Duplicated
0 2020 1 1 False NaN
1 2020 2 1 False NaN
2 2020 3 1 False NaN
3 2021 1 2 True 2021.0
4 2021 4 2 False NaN
5 2021 2 2 True 2021.0
6 2022 5 3 False NaN
7 2022 1 3 True 2022.0
8 2022 4 3 True 2022.0
与和一起使用:
详细信息:
print (df.groupby('ID')['Year'].transform('first'))
0 2020
1 2020
2 2020
3 2020
4 2021
5 2020
6 2022
7 2020
8 2021
Name: Year, dtype: int64
这将产生列
Year\u Duplicate
数据框dfo
中指示的方式:
dfi['Duplicate'] = dfi.duplicated(subset='ID', keep='first')
first_year = dfi.groupby('ID')['Year'].first()
dfi['Year_Duplicate'] = dfi.loc[dfi['Duplicate'], 'ID'].map(first_year)
输出
Year ID $ Duplicate Year_Duplicate
0 2020 1 1 False NaN
1 2020 2 1 False NaN
2 2020 3 1 False NaN
3 2021 1 2 True 2020.0
4 2021 4 2 False NaN
5 2021 2 2 True 2020.0
6 2022 5 3 False NaN
7 2022 1 3 True 2020.0
8 2022 4 3 True 2021.0
dfo.equals(dfi) #True
“$”的作用是什么?
我们可以省略它吗?“$”只是代表销售。我们可以忽略它。你能给样品添加更多数据吗?因为不能100%确定我的答案是否符合dfo
的要求,Year\u Duplicate
列应该显示ID
发生的最早年份,而不仅仅是Year
列的副本,不是吗?目前接受的答案不正确。请仔细查看Year\u Duplicate
列中的值。这就是我明确询问该专栏的含义的原因。这里的答案似乎与dfo不符。你能更正一下你的答案吗?