识别在R数据帧中出现一定次数的值_R

识别在R数据帧中出现一定次数的值

识别在R数据帧中出现一定次数的值,r,R,我有一个字符串的数据帧，其中大部分是重复的。我想确定这个数据框中至少出现x次的值 df <- data.frame(x = c("str", "str", "str", "ing", "ing",".")) occurs <- 3 df也许表格就是您所需要的-下面是一个基于您的代码修改的示例： > df <- data.frame(x = c("str", "str", "str", "ing", "ing",".")) > df x 1 st

我有一个字符串的数据帧，其中大部分是重复的。我想确定这个数据框中至少出现x次的值

   df <- data.frame(x = c("str", "str", "str", "ing", "ing","."))
   occurs <- 3

df也许表格
就是您所需要的-下面是一个基于您的代码修改的示例：
> df <- data.frame(x = c("str", "str", "str", "ing", "ing","."))
> df
    x
1 str
2 str
3 str
4 ing
5 ing
6   .
> table(df$x)

  . ing str 
  1   2   3 
> table(df$x) > 2

    .   ing   str 
FALSE FALSE  TRUE 
> names(which(table(df$x) > 2))
[1] "str"

>测向
x
1 str
2 str
3 str
4
五,
6.
>表（df$x）
. ing str
1   2   3 
>表（df$x）>2
.   ing str
假假假真
>名称（表（df$x）>2）
[1] “str”
也许表格
就是您所需要的-下面是一个基于您的代码修改的示例：
> df <- data.frame(x = c("str", "str", "str", "ing", "ing","."))
> df
    x
1 str
2 str
3 str
4 ing
5 ing
6   .
> table(df$x)

  . ing str 
  1   2   3 
> table(df$x) > 2

    .   ing   str 
FALSE FALSE  TRUE 
> names(which(table(df$x) > 2))
[1] "str"

>测向
x
1 str
2 str
3 str
4
五,
6.
>表（df$x）
. ing str
1   2   3 
>表（df$x）>2
.   ing str
假假假真
>名称（表（df$x）>2）
[1] “str”
您也可以使用计数
：
library(dplyr)
df %>% count(x)

这将调用n（）
# Source: local data frame [3 x 2]
#
#     x n
# 1   . 1
# 2 ing 2
# 3 str 3

如果您只希望这些事件至少发生3次，请使用filter（）
：
其中：
# Source: local data frame [1 x 2]
# 
#     x n
# 1 str 3

最后，如果只想提取与筛选条件相对应的因子：
df %>% count(x) %>% filter(n >= 3) %>% .$x

# [1] str
# Levels: . ing str


根据@David在评论中的建议，您也可以使用数据。表
：
library(data.table)
setDT(df)[, if(.N >= 3) x, by = x]$V1

或
根据@Frank的建议，您还可以使用表格的“主力”表格：
levels(df[[1]])[tabulate(df[[1]])>=3]

# [1] "str" 


基准
df <- data.frame(x = sample(LETTERS[1:26], 10e6, replace = TRUE))
df2 <- copy(df)

library(microbenchmark)
mbm <- microbenchmark(
  base = names(which(table(df$x) >= 385000)),
  base2 = levels(df[[1]])[tabulate(df[[1]])>385000L],
  dplyr = count(df, x) %>% filter(n >= 385000) %>% .$x,
  DT1 = setDT(df2)[, if(.N >= 385000) x, by = x]$V1,
  DT2 = setDT(df2)[, .N, by = x][, x[N >= 385000]],
  times = 50
)

您还可以使用count
：
library(dplyr)
df %>% count(x)

这将调用n（）
# Source: local data frame [3 x 2]
#
#     x n
# 1   . 1
# 2 ing 2
# 3 str 3

如果您只希望这些事件至少发生3次，请使用filter（）
：
其中：
# Source: local data frame [1 x 2]
# 
#     x n
# 1 str 3

最后，如果只想提取与筛选条件相对应的因子：
df %>% count(x) %>% filter(n >= 3) %>% .$x

# [1] str
# Levels: . ing str


根据@David在评论中的建议，您也可以使用数据。表
：
library(data.table)
setDT(df)[, if(.N >= 3) x, by = x]$V1

或
根据@Frank的建议，您还可以使用表格的“主力”表格：
levels(df[[1]])[tabulate(df[[1]])>=3]

# [1] "str" 


基准
df <- data.frame(x = sample(LETTERS[1:26], 10e6, replace = TRUE))
df2 <- copy(df)

library(microbenchmark)
mbm <- microbenchmark(
  base = names(which(table(df$x) >= 385000)),
  base2 = levels(df[[1]])[tabulate(df[[1]])>385000L],
  dplyr = count(df, x) %>% filter(n >= 385000) %>% .$x,
  DT1 = setDT(df2)[, if(.N >= 385000) x, by = x]$V1,
  DT2 = setDT(df2)[, .N, by = x][, x[N >= 385000]],
  times = 50
)

我想知道如何库（data.table）；setDT（df）[如果（.N>=发生）x，则由=x]$V1执行。或者可能setDT（df）[，.N，by=x][，x[N>=ocurses]]
（不确定哪个更好）应该非常快。让我将它添加到基准测试中。添加时，不要在同一数据集上运行它。创建df2我不认为我会关心这个操作的速度，但base在我的计算机上获胜：base2=levels（df[[1]]）[tablate（df[[1]]）>385000L]
@Frank Yes，table
的“工作马”tablate
确实快得多。我相应地更新了基准。我想知道如何库（data.table）；setDT（df）[如果（.N>=发生）x，则由=x]$V1执行。或者可能setDT（df）[，.N，by=x][，x[N>=ocurses]]
（不确定哪个更好）应该非常快。让我将它添加到基准测试中。添加时，不要在同一数据集上运行它。创建df2我不认为我会关心这个操作的速度，但base在我的计算机上获胜：base2=levels（df[[1]]）[tablate（df[[1]]）>385000L]
@Frank Yes，table
的“工作马”tablate
确实快得多。我相应地更新了基准。