Python 创建列";地位;,基于另一列中不同元素的存在,对于每一对(“用户id”和“日期”)
我只想创建下表中显示的“status”列。如果“问题”列中每对“用户id”和“日期”有4个元素(“a”、“b”、“c”、“d”),则本列指定标签“完成”或“不完整”Python 创建列";地位;,基于另一列中不同元素的存在,对于每一对(“用户id”和“日期”),python,pandas,dataframe,Python,Pandas,Dataframe,我只想创建下表中显示的“status”列。如果“问题”列中每对“用户id”和“日期”有4个元素(“a”、“b”、“c”、“d”),则本列指定标签“完成”或“不完整” 您可以使用GroupBy.transform和set来删除重复项,然后使用len函数来计算生成的唯一元素,这将允许我们查看“question”列中每对“user\u id”和“date”是否有4个元素(“a”、“b”、“c”、“d”) 尽量不要删除temp列,以清楚地看到它的作用。它将显示每个用户id每个日期回答了多少问题 df=(
您可以使用
GroupBy.transform
和set
来删除重复项,然后使用len
函数来计算生成的唯一元素,这将允许我们查看“question”列中每对“user\u id”和“date”是否有4个元素(“a”、“b”、“c”、“d”)
尽量不要删除temp列,以清楚地看到它的作用。它将显示每个用户id每个日期回答了多少问题
df=(pd.DataFrame({'user_id':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4],
'date':["01-01-2020","01-01-2020","01-01-2020","01-01-2020","02-01-2020","02-01-2020","03-01-2020","04-01-2020",
"04-01-2020","04-01-2020","05-01-2020","05-01-2020","06-01-2020","06-01-2020","07-01-2020","08-01-2020",
"08-01-2020","09-01-2020","09-01-2020","09-01-2020","09-01-2020","10-01-2020","10-01-2020","11-01-2020",
"11-01-2020","12-01-2020","12-01-2020"],
'question':["a","b","c","d","a","b","a","a","b","c","a","b","a","b","a","a","b","a","b","c","d","a","b","a","b","a","b"],
'status': ["Complete","Complete","Complete","Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete",
"Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Complete","Complete","Complete",
"Complete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete","Incomplete"]}))
display_full(df)
| | user_id | date | question | status |
|---:|----------:|:-----------|:-----------|:-----------|
| 0 | 1 | 01-01-2020 | a | Complete |
| 1 | 1 | 01-01-2020 | b | Complete |
| 2 | 1 | 01-01-2020 | c | Complete |
| 3 | 1 | 01-01-2020 | d | Complete |
| 4 | 1 | 02-01-2020 | a | Incomplete |
| 5 | 1 | 02-01-2020 | b | Incomplete |
| 6 | 1 | 03-01-2020 | a | Incomplete |
| 7 | 2 | 04-01-2020 | a | Incomplete |
| 8 | 2 | 04-01-2020 | b | Incomplete |
| 9 | 2 | 04-01-2020 | c | Incomplete |
| 10 | 2 | 05-01-2020 | a | Incomplete |
| 11 | 2 | 05-01-2020 | b | Incomplete |
| 12 | 2 | 06-01-2020 | a | Incomplete |
| 13 | 2 | 06-01-2020 | b | Incomplete |
| 14 | 3 | 07-01-2020 | a | Incomplete |
| 15 | 3 | 08-01-2020 | a | Incomplete |
| 16 | 3 | 08-01-2020 | b | Incomplete |
| 17 | 3 | 09-01-2020 | a | Complete |
| 18 | 3 | 09-01-2020 | b | Complete |
| 19 | 3 | 09-01-2020 | c | Complete |
| 20 | 3 | 09-01-2020 | d | Complete |
| 21 | 4 | 10-01-2020 | a | Incomplete |
| 22 | 4 | 10-01-2020 | b | Incomplete |
| 23 | 4 | 11-01-2020 | a | Incomplete |
| 24 | 4 | 11-01-2020 | b | Incomplete |
| 25 | 4 | 12-01-2020 | a | Incomplete |
| 26 | 4 | 12-01-2020 | b | Incomplete |
import numpy as np
import pandas as pd
df['temp'] = df.groupby(['user_id','date'])['question'].transform(lambda x: len(set(x)))
df['status_new'] = np.where(df['temp'] <4,'Incomplete','Complete')
df.drop('temp',axis=1,inplace=True)
user_id date question status status_new
0 1 01-01-2020 a Complete Complete
1 1 01-01-2020 b Complete Complete
2 1 01-01-2020 c Complete Complete
3 1 01-01-2020 d Complete Complete
4 1 02-01-2020 a Incomplete Incomplete
5 1 02-01-2020 b Incomplete Incomplete
6 1 03-01-2020 a Incomplete Incomplete
7 2 04-01-2020 a Incomplete Incomplete
8 2 04-01-2020 b Incomplete Incomplete
9 2 04-01-2020 c Incomplete Incomplete
10 2 05-01-2020 a Incomplete Incomplete
11 2 05-01-2020 b Incomplete Incomplete
12 2 06-01-2020 a Incomplete Incomplete
13 2 06-01-2020 b Incomplete Incomplete
14 3 07-01-2020 a Incomplete Incomplete
15 3 08-01-2020 a Incomplete Incomplete
16 3 08-01-2020 b Incomplete Incomplete
17 3 09-01-2020 a Complete Complete
18 3 09-01-2020 b Complete Complete
19 3 09-01-2020 c Complete Complete
20 3 09-01-2020 d Complete Complete
21 4 10-01-2020 a Incomplete Incomplete
22 4 10-01-2020 b Incomplete Incomplete
23 4 11-01-2020 a Incomplete Incomplete
24 4 11-01-2020 b Incomplete Incomplete
25 4 12-01-2020 a Incomplete Incomplete
26 4 12-01-2020 b Incomplete Incomplete