如何知道每次事件前的最后日志?R语言
这是我的桌子:如何知道每次事件前的最后日志?R语言,r,R,这是我的桌子: user_id event timestamp Rob business 111111 Rob progress 111112 Rob business 222222 Mike progress 111111 Mike progress 222222 Rob progress 000001 Mike business 333333
user_id event timestamp
Rob business 111111
Rob progress 111112
Rob business 222222
Mike progress 111111
Mike progress 222222
Rob progress 000001
Mike business 333333
Mike progress 444444
Lee progress 111111
Lee progress 222222
Mike business 333334
Dput表:
dput(input)
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L),
.Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L),
.Label = c("business", "progress"), class = "factor"),
timestamp = c(111111,111112, 222222, 111111, 222222, 1, 333333, 444444, 111111, 222222, 333334)),
.Names = c("user_id", "event", "timestamp"), row.names = c(NA, -11L), class = "data.frame")
谢谢你的帮助
df <-
structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L,
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business",
"progress"), class = "factor"), timestamp = c(111111,111112, 222222,
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id",
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")
#I want to know last progress event before every business event happens
new <- df[0,]
for(i in 2:nrow(df)){
if(df$event[i] == "business" & df$event[i-1] == "progress"){
new <- rbind(new, df[i-1,])
}
}
new
请注意,结果中只有两行,因为
business
只出现了三次,而它第一次出现在第一行。只要我正确理解了这个问题,这似乎可以通过使用lag
函数和dplyr
来解决
下面是一个例子:
# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L,
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business",
"progress"), class = "factor"), timestamp = c(111111,111112, 222222,
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id",
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")
# Perform the manipulation
df %>%
arrange(user_id, timestamp) %>% # Sort by user and timestamp
group_by(user_id) %>% # Group/partition by each user
mutate(last_event = lag(event, 1), # Find the last event
last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
rename(event = last_event, # Tidy up the field names
timestamp = last_timestamp)
在这个数据集上,输出将是相同的,但如果其他事件进入,这可能是一个必要的步骤。第二行是如何包含在预期输出中的?@RonakShah抱歉,这是一个错误mistake@Hack-R是的sorry@Smasell不用担心,我的答案中有正确的结果。如果您有任何问题,请告诉我。@Hack-R我是指每个用户的id。Mike的业务时间戳为333333,进展时间戳为222222。但是Mike在哪里?@Smasell Mike不应该出现在结果中。迈克在做生意之前从未有过进步。仔细检查您的示例,如果您仍然感到困惑,请告诉我一个行号。迈克有
progress
,但从来没有progress
就在business
之前。对不起!使用时间戳了解每个用户的顺序_id@Smasell没问题。我不确定我是否理解你的评论。你知道为什么这个结果现在是正确的吗?它捕获每个业务
事件行之前的最后一个进度
事件行,如您所要求的那样。@Hack-R OP再次更新了问题。我不确定他想要结果的条件是什么。不管怎样,现在我已经删除了我的答案。Thx它可以工作,但我面临的问题是,如果我有两个连续的商业事件,我的结果中有商业事件!如何修复它?您可以使用类似的逻辑与lead
和/或lag
以及row\u number()
函数来识别一个接一个发生的业务
事件。然后,您可以删除一个或多个连续的业务
事件。或者,您可以展开lag
逻辑,只需在任何业务
事件(甚至是连续事件)之前查找最新的进度
事件。这取决于您是否对业务
事件之前的进度
事件感兴趣,或者只是每个business
事件之前的最新progress
事件。我对每个business事件之前的最新progress事件感兴趣如果您模拟一些示例数据,我可以想出一种简单连接的方法。
user_id event timestamp
2 Rob progress 111112
6 Rob progress 1
# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L,
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business",
"progress"), class = "factor"), timestamp = c(111111,111112, 222222,
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id",
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")
# Perform the manipulation
df %>%
arrange(user_id, timestamp) %>% # Sort by user and timestamp
group_by(user_id) %>% # Group/partition by each user
mutate(last_event = lag(event, 1), # Find the last event
last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
rename(event = last_event, # Tidy up the field names
timestamp = last_timestamp)
user_id event timestamp
<fctr> <fctr> <dbl>
1 Mike progress 222222
2 Rob progress 1
3 Rob progress 111112
df %>%
filter(event == "business"|event == "progress") %>%
arrange(user_id, timestamp) %>%
group_by(user_id) %>%
mutate(last_event = lag(event, 1),
last_timestamp = lag(timestamp, 1)) %>%
filter(event == "business") %>%
select(user_id, last_event, last_timestamp) %>%
rename(event = last_event,
timestamp = last_timestamp)