在R中重新排列数据帧_R_Data Manipulation_Traminer

在R中重新排列数据帧

在R中重新排列数据帧,r,data-manipulation,traminer,R,Data Manipulation,Traminer,我有一个如下所示的数据帧： created_at actor_attributes_email type 3/11/12 7:28 jeremy@asynk.ch PushEvent 3/11/12 7:28 jeremy@asynk.ch PushEvent 3/11/12 7:28 jeremy@asynk.ch PushEvent 3/11/12 7:42 jeremy@asynk.ch I

我有一个如下所示的数据帧：

created_at  actor_attributes_email      type
3/11/12 7:28    jeremy@asynk.ch         PushEvent
3/11/12 7:28    jeremy@asynk.ch         PushEvent
3/11/12 7:28    jeremy@asynk.ch         PushEvent
3/11/12 7:42    jeremy@asynk.ch         IssueCommentEvent
3/11/12 11:06   d.bussink@gmail.com     PushEvent
3/11/12 11:06   d.bussink@gmail.com     PushEvent

现在我想按月/年重新排列它（仍然按时间排序，并且仍然保持行的完整性）。这应该为每个月创建3列，然后将与该月相关的所有数据（创建时间、参与者属性、电子邮件和类型）放在这3列中，以便我获得以下标题（数据中存在的所有月份）：

我怎样才能在R中实现这一点

包含整个数据集的CSV文件可在以下位置找到：

以下是CSV第一行的

dput（）

：

structure(list(created_at = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 
8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L), .Label = c("2012-03-11 07:28:04", 
"2012-03-11 07:28:19", "2012-03-11 07:42:16", "2012-03-11 11:06:13", 
"2012-03-11 12:46:25", "2012-03-11 13:03:12", "2012-03-11 13:12:34", 
"2012-03-11 13:14:52", "2012-03-11 13:30:14", "2012-03-11 13:30:48"
), class = "factor"), actor_attributes_email = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"d.bussink@gmail.com", "jeremy@asynk.ch"), class = "factor"), 
    type = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L), .Label = c("IssueCommentEvent", "PushEvent"
    ), class = "factor")), .Names = c("created_at", "actor_attributes_email", 
"type"), class = "data.frame", row.names = c(NA, -30L))

其他一些假设是：

即使“PushEvent”（例如）重复10次，我也需要保留所有这些，因为我将使用R包TraMineR进行序列分析
列的长度可以不等
不同月份的列之间没有关系
某个月内的数据应首先以最早的时间进行排序
比如说，2011年6月和2012年6月的数据需要在单独的列中

库（plyr）
图书馆（lubridate）
df$created_atMaiasaura为plyr和lubridate提供了一种优雅的工作方式。下面是在BaseR中实现它的稍微不那么优雅的方法。但与Maiasaura的不同，这种方法最小化了NA
行的数量。每个月的NA
行数是该月的行数与任何月份的最大行数之差
# split df by month
by.mon <- split(df, months(as.POSIXct(df$created_at)))

# rename the columns to include the month name
by.mon <- mapply(
    function(x, mon.name) {
        names(x) <- paste(mon.name, names(x), sep='_');
        return(x)
    }, x=by.mon, mon.name=names(by.mon), SIMPLIFY=FALSE)

# add an index column for merging on
by.mon.indexed <- lapply(by.mon, function(x) within(x, index <- 1:nrow(x)))

# merge all of the months together
results <- Reduce(function(x, y) merge(x, y, by='index', all=TRUE, sort=FALSE), 
    by.mon.indexed)

# remove the index column
final_result <- results[names(results) != 'index']

#按月拆分df
by.mon对您和mplourde能够理解问题陈述印象深刻。谢谢。然而，这以一种不方便的方式交错出现，即如果4月在第1543行结束，那么5月在第1544行开始）。是否有办法确保每个月的数据从第2行开始（即在标题之后）？还有，我如何让月份按时间顺序排列？将每个月作为一列确实会让它变得非常广泛。如果跳过最后一行，结果
列表将只包含每个月，这些月可以轻松写入data.frame或进一步工作，而无需合并成一个庞大而笨拙的data.frame（我认为您出于某种原因需要它）。是否有可能使今年也对年份敏感？现在，它将2011年6月和2012年6月的活动放在同一列中。谢谢！有没有办法确保月份按顺序排列？您可以使用：by.mon对by.mon进行排序：by.mon是否可以使今年也对年份敏感？现在它把2011年6月和2012年6月的事件放在同一列。
library(plyr)
library(lubridate)
df$created_at <- ymd_hms(df$created_at, quiet = TRUE)
df$mname <- as.character(lubridate::month(df$created_at,label = T, abbr = T))
result <- dlply(df, .(mname), function(x){
      x <- arrange(x, created_at)
      names(x) <- paste0(unique(x$mname), "_", names(x))
      x$mname <- NULL
      x
    }, .progress = 'text')

final_result <- ldply(result, rbind.fill)[, -1]

# split df by month
by.mon <- split(df, months(as.POSIXct(df$created_at)))

# rename the columns to include the month name
by.mon <- mapply(
    function(x, mon.name) {
        names(x) <- paste(mon.name, names(x), sep='_');
        return(x)
    }, x=by.mon, mon.name=names(by.mon), SIMPLIFY=FALSE)

# add an index column for merging on
by.mon.indexed <- lapply(by.mon, function(x) within(x, index <- 1:nrow(x)))

# merge all of the months together
results <- Reduce(function(x, y) merge(x, y, by='index', all=TRUE, sort=FALSE), 
    by.mon.indexed)

# remove the index column
final_result <- results[names(results) != 'index']