使用aggregate.data.frame连接数据帧中的单元格
我有一个如下所示的数据帧:使用aggregate.data.frame连接数据帧中的单元格,r,R,我有一个如下所示的数据帧: df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d")) df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11) df bee.num plant visits 1 1 d 1
df <- data.frame(bee.num=c(1,1,1,2,2,3,3), plant=c("d","d","w","d","d","w","d"))
df$visits = list(1:3, 4:9, 10:11, 1:10, 11:12, 1:4,5:11)
df
bee.num plant visits
1 1 d 1, 2, 3
2 1 d 4, 5, 6, 7, 8, 9
3 1 w 10, 11
4 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
5 2 d 11, 12
6 3 w 1, 2, 3, 4
7 3 d 5, 6, 7, 8, 9, 10, 11
我试过了
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
及
但我总是得到一个“参数意味着行数不同”的错误。任何帮助都将不胜感激。提前感谢。在回答您的具体问题时,我认为aggregate.data.frame不容易做到这一点 正如我在之前的文章中所说,大多数R用户可能会在
plyr
中想出一种方法来实现这一点
然而,由于我第一次接触数据分析是通过数据库脚本编写的,因此对于这类任务,我仍然偏爱sqldf
包
我还发现SQL对非R用户更加透明(这是我在从事大部分工作的社会科学社区中经常遇到的情况)
以下是使用sqldf
解决问题的方法:
#your data assigned to dat
bee.num <- c(1,1,1,2,2,3,3)
plant <- c("d", "d", "w", "d", "d", "w", "d")
visits <- c("1, 2, 3"
,"4, 5, 6, 7, 8, 9"
,"10, 11"
,"1, 2, 3, 4, 5, 6, 7, 8, 9, 10"
,"11, 12"
,"1, 2, 3, 4"
,"5, 6, 7, 8, 9, 10, 11")
dat <- as.data.frame(cbind(bee_num, plant, visits))
#load sqldf
require(sqldf)
#write a simple SQL aggregate query using group_concat()
#i.e. "select" your fields specifying the aggregate function for the
#relevant field, "from" a table called dat, and "group by" bee_num
#(because sql_df converts "." into "_" for field names) and plant.
sqldf('select
bee_num
,plant
,group_concat(visits) visits
from dat
group by
bee_num
,plant')
bee_num plant visits
1 1 d 1, 2, 3,4, 5, 6, 7, 8, 9
2 1 w 10, 11
3 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12
4 3 d 5, 6, 7, 8, 9, 10, 11
5 3 w 1, 2, 3, 4
#分配给dat的数据
bee.num如果您将包含列表的数据框作为列传递,而不是传递列表本身,则该函数将按预期工作
x <- aggregate.data.frame(df['visits'], list(df$bee.num, df$plant) , FUN=c)
names(x) <- c('bee.num', 'plant', 'visits')
x
## bee.num plant visits
## 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
## 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
## 3 3 d 5, 6, 7, 8, 9, 10, 11
## 4 1 w 10, 11
## 5 3 w 1, 2, 3, 4
因此,只需调用上面的aggregate
另外请注意,错误是因为试图将列表强制为数据帧。aggregate.data.frame
的前两行如下所示:
if (!is.data.frame(x))
x <- as.data.frame(x)
只能将“矩形”列表强制为data.frame
。所有条目的长度必须相同。如果先取消列出列表
列,并使其具有一个长的数据框,则也可以获得所需的输出:
visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
# bee.num plant x
# 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3 3 d 5, 6, 7, 8, 9, 10, 11
# 4 1 w 10, 11
# 5 3 w 1, 2, 3, 4
## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)
访问我建议您为我们提供一个可复制的示例。谢谢,刚刚添加了一些代码,用于复制第一个数据帧-希望能够满足您的建议。谢谢您的帮助性解释。在我使用的更大的数据集上,一切似乎都很完美。+1。有趣的发现。但是,我们不能使用aggregate.formula
,因此在进一步使用data.frame
之前“融化”它,这有点令人恼火+1解释一下。@AnandaMahto是的,这是你的方法的优势。为了可读性,我确实更喜欢aggregate.formula,但它拒绝处理列表列。事实上,正是出于这个原因,我想+1你的帖子。你的输入数据似乎与OP的输入数据不匹配。公平点。我的数据是根据问题的早期版本自行创建的,其中不清楚visions
列是整数还是字符串。如果访问
被存储为整数向量列表(而不是逗号分隔的值),那么sqldf将不会如此简单地工作。在任何情况下,aggregate.data.frame似乎比我想象的更灵活。作为一个优势(尽管答案中没有说明),您可以将aggregate.formula
与这个“长”数据框架一起使用。
> class(df$visits)
[1] "list"
> class(df['visits'])
[1] "data.frame"
if (!is.data.frame(x))
x <- as.data.frame(x)
as.data.frame(df$visits)
## Error in data.frame(1:3, 4:9, 10:11, 1:10, 11:12, 1:4, 5:11, check.names = TRUE, :
## arguments imply differing number of rows: 3, 6, 2, 10, 4, 7
visits <- unlist(df$visits, use.names=FALSE)
df <- df[rep(rownames(df), sapply(df$visits, length)), c("bee.num", "plant")]
df$visits <- visits
aggregate.data.frame(df$visits, by=list(bee.num = df$bee.num, plant = df$plant), FUN=c)
# bee.num plant x
# 1 1 d 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 2 d 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
# 3 3 d 5, 6, 7, 8, 9, 10, 11
# 4 1 w 10, 11
# 5 3 w 1, 2, 3, 4
## Or, better yet:
aggregate(visits ~ bee.num + plant, df, c)
library(data.table)
DT <- data.table(df)
setkey(DT, bee.num, plant)
DT[, list(visits = list(unlist(visits))), by = key(DT)]
# bee.num plant visits
# 1: 1 d 1,2,3,4,5,6,
# 2: 1 w 10,11
# 3: 2 d 1,2,3,4,5,6,
# 4: 3 d 5,6,7,8,9,10,
# 5: 3 w 1,2,3,4
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
# $ bee.num: num 1 1 2 3 3
# $ plant : Factor w/ 2 levels "d","w": 1 2 1 1 2
# $ visits :List of 5
# ..$ : int 1 2 3 4 5 6 7 8 9
# ..$ : int 10 11
# ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
# ..$ : int 5 6 7 8 9 10 11
# ..$ : int 1 2 3 4
# - attr(*, "sorted")= chr "bee.num" "plant"
# - attr(*, ".internal.selfref")=<externalptr>