如何知道每次事件前的最后日志?R语言

如何知道每次事件前的最后日志?R语言,r,R,这是我的桌子: user_id event timestamp Rob business 111111 Rob progress 111112 Rob business 222222 Mike progress 111111 Mike progress 222222 Rob progress 000001 Mike business 333333

这是我的桌子:

user_id    event       timestamp
Rob        business    111111
Rob        progress    111112
Rob        business    222222
Mike       progress    111111
Mike       progress    222222
Rob        progress    000001
Mike       business    333333
Mike       progress    444444
Lee        progress    111111
Lee        progress    222222
Mike       business    333334
Dput表:

    dput(input)
    df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L),
 .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
 event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L),
 .Label = c("business", "progress"), class = "factor"), 
timestamp = c(111111,111112, 222222, 111111, 222222, 1, 333333, 444444, 111111, 222222, 333334)), 
.Names = c("user_id", "event", "timestamp"), row.names = c(NA, -11L), class = "data.frame")
谢谢你的帮助

df <-
structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
"progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

#I want to know last progress event before every business event happens

new <- df[0,]  
for(i in 2:nrow(df)){
  if(df$event[i] == "business" & df$event[i-1] == "progress"){
   new <- rbind(new, df[i-1,]) 
  }
}  
new

请注意,结果中只有两行,因为
business
只出现了三次,而它第一次出现在第一行。

只要我正确理解了这个问题,这似乎可以通过使用
lag
函数和
dplyr
来解决

下面是一个例子:

# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
    2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
    event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
    "progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
    111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
    "event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

# Perform the manipulation
df %>% 
    arrange(user_id, timestamp) %>% # Sort by user and timestamp
    group_by(user_id) %>% # Group/partition by each user
    mutate(last_event = lag(event, 1), # Find the last event
           last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
    filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
    select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
    rename(event = last_event, # Tidy up the field names
           timestamp = last_timestamp)

在这个数据集上,输出将是相同的,但如果其他事件进入,这可能是一个必要的步骤。

第二行是如何包含在预期输出中的?@RonakShah抱歉,这是一个错误mistake@Hack-R是的sorry@Smasell不用担心,我的答案中有正确的结果。如果您有任何问题,请告诉我。@Hack-R我是指每个用户的id。Mike的业务时间戳为333333,进展时间戳为222222。但是Mike在哪里?@Smasell Mike不应该出现在结果中。迈克在做生意之前从未有过进步。仔细检查您的示例,如果您仍然感到困惑,请告诉我一个行号。迈克有
progress
,但从来没有
progress
就在
business
之前。对不起!使用时间戳了解每个用户的顺序_id@Smasell没问题。我不确定我是否理解你的评论。你知道为什么这个结果现在是正确的吗?它捕获每个
业务
事件行之前的最后一个
进度
事件行,如您所要求的那样。@Hack-R OP再次更新了问题。我不确定他想要结果的条件是什么。不管怎样,现在我已经删除了我的答案。Thx它可以工作,但我面临的问题是,如果我有两个连续的商业事件,我的结果中有商业事件!如何修复它?您可以使用类似的逻辑与
lead
和/或
lag
以及
row\u number()
函数来识别一个接一个发生的
业务
事件。然后,您可以删除一个或多个连续的
业务
事件。或者,您可以展开
lag
逻辑,只需在任何
业务
事件(甚至是连续事件)之前查找最新的
进度
事件。这取决于您是否对
业务
事件之前的
进度
事件感兴趣,或者只是每个
business
事件之前的最新
progress
事件。我对每个business事件之前的最新progress事件感兴趣如果您模拟一些示例数据,我可以想出一种简单连接的方法。
  user_id    event timestamp
2     Rob progress    111112
6     Rob progress         1
# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
    2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
    event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
    "progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
    111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
    "event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

# Perform the manipulation
df %>% 
    arrange(user_id, timestamp) %>% # Sort by user and timestamp
    group_by(user_id) %>% # Group/partition by each user
    mutate(last_event = lag(event, 1), # Find the last event
           last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
    filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
    select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
    rename(event = last_event, # Tidy up the field names
           timestamp = last_timestamp)
  user_id    event timestamp
   <fctr>   <fctr>     <dbl>
1    Mike progress    222222
2     Rob progress         1
3     Rob progress    111112
df %>% 
    filter(event == "business"|event == "progress") %>% 
    arrange(user_id, timestamp) %>% 
    group_by(user_id) %>% 
    mutate(last_event = lag(event, 1),
           last_timestamp = lag(timestamp, 1)) %>% 
    filter(event == "business") %>% 
    select(user_id, last_event, last_timestamp) %>% 
    rename(event = last_event, 
           timestamp = last_timestamp)