R 向显示变量频率的数据框中添加列_R

R 向显示变量频率的数据框中添加列

R 向显示变量频率的数据框中添加列,r,R,在R，总是一些小事让我困惑假设我有这样一个数据帧： location species 1 seattle A 2 buffalo C 3 seattle D 4 newark J 5 boston Q location species freq-loc 1 seattle A 2 #there are 2 entries with location=seattle 2 buffalo C

在R，总是一些小事让我困惑

假设我有这样一个数据帧：

  location   species
1  seattle   A
2  buffalo   C
3  seattle   D
4  newark    J
5  boston    Q

  location   species    freq-loc
1  seattle   A          2           #there are 2 entries with location=seattle
2  buffalo   C          1           #there is 1 entry with location=buffalo
3  seattle   D          2
4  newark    J          1
5  boston    Q          1

我想在此框架中附加一列，显示位置在数据集中出现的次数，结果如下：

  location   species
1  seattle   A
2  buffalo   C
3  seattle   D
4  newark    J
5  boston    Q

  location   species    freq-loc
1  seattle   A          2           #there are 2 entries with location=seattle
2  buffalo   C          1           #there is 1 entry with location=buffalo
3  seattle   D          2
4  newark    J          1
5  boston    Q          1

我知道使用

表（data$location）

可以给我一个列联表。但我不知道如何将表中的每个值映射到dataframe中的相应条目。有人能帮忙吗

更新

非常感谢你的帮助！出于兴趣，我运行了一个基准测试，以查看merge、plyr和ave解决方案的运行情况。测试集是我原来的10×7mil数据集的10000行子集：

Unit: milliseconds
expr        min         lq     median        uq       max neval
MERGE 110.877337 111.989406 112.585420 113.51679 120.23588   100
PLYR  26.305645  27.080403  27.576580  27.87157  68.40763   100
AVE   2.994528   3.117255   3.179898   3.35834  10.02955   100

我相信不久会有人发布一个（丑陋的；）

ave

或

plyr

解决方案，但以下是

数据。表一：
library(data.table)
dt = data.table(your_df)

dt[, `freq-loc` := .N, by = location]
# note: using `-quotes around your var name, because of the "-" in the name

合并：
此外，我还听到一个请求，要求提供plyr

：

library(plyr)
join(data, data.frame(table(location = data$location)))
# Joining by: location
# location species Freq
# 1  seattle       A    2
# 2  buffalo       C    1
# 3  seattle       D    2
# 4   newark       J    1
# 5   boston       Q    1

这里有一个带

ave

的基本R方式

transform(d, freq.loc = ave(seq(nrow(d)), location, FUN=length))

尝试在列名中使用破折号将非常痛苦。最好使用下划线或“点”

dfrm$freq\u loc nah，“正确的”plyr

解决方案我认为是：

ddply（df，.（location），mutate，freq.loc=length（location））

我说的“正确”是指“概念上正确的”，至少就

plyr

框架而言是这样，而不是更快。我不能说我关心这两种方法的速度，因为我坚定地站在“

data.table

这样做更好”的阵营中，但如果您对此感兴趣，那么您应该这样做并发布结果。使用像

microbenchmark

这样的基准测试包可能是最好的选择。