R 如果值大于3倍,如何将下一个重复值设置为空白

R 如果值大于3倍,如何将下一个重复值设置为空白,r,R,我有一个数据框,如下所示。在数据框中,值45重复/出现的次数大于3次,值67重复/出现的次数也相同,对于重复/冻结的次数大于3次的新值,现在需要将其设为空白/NA Name Value New_Value A 24 24 A 45 45 A 45 A 45 A 45 A 45 A 93 93 A

我有一个数据框,如下所示。在数据框中,值45重复/出现的次数大于3次,值67重复/出现的次数也相同,对于重复/冻结的次数大于3次的新值,现在需要将其设为空白/NA

Name    Value   New_Value
 A       24      24
 A       45      45
 A       45      
 A       45      
 A       45      
 A       45      
 A       93      93 
 A       19      19
 A       10      10
 B       29      29
 B       67      67
 B       67         
 B       67      
 B       67      
 C      201     201
 C      993     993
 C      396     396
下面是一个data.table方法。我展示了两种解决方案,例如

在“名称”组中查找重复项 在所有数据中查找重复项 代码如下:

library(data.table)
dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
                 Value = sample(1:10, 100, replace = TRUE))
dt <- dt[order(Names, Value)]

# if you look for in-group duplicates
dt[, count := .N, by = .(Names, Value)][, New_Value := Value]
dt[ , dup_ingroup := duplicated(Value), by = Names]
dt[dup_ingroup & count > 3, New_Value := NA]

# if you look for all duplicates
dt[, count := .N, by = Value][, New_Value := Value]
dt[duplicated(Value) & count > 3, New_Value := NA]

请参阅下面的评论

library(data.table)
library(dplyr)
set.seed(20170515)
dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
                 Value = sample(1:10, 100, replace = TRUE))
dt <- dt[order(Names, Value)]
dt_1 <- copy(dt)
dt_2 <- copy(dt) 
dt_Jaap <- copy(dt)
# Method 1
dt_1[, count := .N, by = .(Names, Value)][, New_Value := Value]
dt_1[ , dup_ingroup := duplicated(Value), by = Names]
dt_1[dup_ingroup & count > 3, New_Value := NA]
dt_1[, .N, by = is.na(New_Value)] 
## is.na  N
## 1: FALSE 73
## 2:  TRUE 27

# Method 2
dt_2[, count := .N, by = Value][, New_Value := Value]
dt_2[duplicated(Value) & count > 3, New_Value := NA]
dt_2[, .N, by = is.na(New_Value)] 
## is.na  N
## 1: FALSE 12
## 2:  TRUE 88

# Method suggested by @Jaap
dt_Jaap[, New_Value := Value][duplicated(Value) & .N > 3, New_Value := NA_integer_, by = .(Names, Value)]
dt_Jaap[, .N, by = is.na(New_Value)]  
## is.na  N
## 1: FALSE 10
## 2:  TRUE 90
dt_Jaap只保留每个值的第一个元素的值。

和dplyr/tidyverse方式,假设数据帧的顺序无关紧要

df <- data.frame(Name = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
                 Value = c(24,45,45,45,45,45,93,19,10,29,67,67,67,67,201,993,396,396,396),
                 stringsAsFactors = F)

library(dplyr)

df %>% 
  group_by(Name, Value) %>% 
  mutate(New_Value = ifelse(n() > 3 & row_number() > 1, NA, Value))
更新

一种更健壮的方法,可以处理多组相同的值

df <- read.table(header = T, stringsAsFactors = F, text = "
Name    Value
A       45
A       45
A       45
A       82
A       45
A       45
A       45
A       45
A       12
A       45
A       45
A       45
A       45
A       45
B       29
B       67
B       67
B       67
B       67
")

library(dplyr)

df %>%
  group_by(Name) %>%
  mutate(run_length = with(rle(Value), rep(lengths, lengths))) %>%
  mutate(run_start = seq_along(Value) %in% cumsum(c(1, rle(Value)$lengths))) %>%
  mutate(New_Value = ifelse(run_length < 4 | run_start, Value, NA)) %>%
  ungroup() %>% select(-run_length, -run_start)
FWIW,这里是另一个data.table解决方案,它使用rleid而不是duplicated

请注意,如果值大于3倍,OP请求将下一个重复值设为空白。这意味着对于重复两次的值,结果中不应出现空白。我已经修改了我的样本数据集,以包含相同值的两次重复的情况

编辑:OP没有明确说明他是在计算给定序列中相同值的重复次数,而不管名称是什么,还是在计算每个名称组序列中的重复次数。另见

此外,OP没有指定如果有一系列重复的值,但名称发生了变化,他希望得到什么结果

因此,我修改了示例数据集,以包括其他用例:

DT
#    Name Value
# 1:    A    24
# 2:    A    24
# 3:    A    45
# 4:    A    45
# 5:    A    45
# 6:    A    45
# 7:    A    45
# 8:    A    93
# 9:    A    19
#10:    A    19
#11:    A    10
#12:    B    29
#13:    B    67
#14:    B    67
#15:    B    67
#16:    B    67
#17:    C   201
#18:    C   993
#19:    C   396
#20:    A    19
#21:    A    19
#22:    C    19
#23:    B    29
#24:    B    67
#25:    B    67
#26:    B    67
#27:    B    67
#28:    C    67
#29:    C    67
#30:    C    67
#31:    C    67
#    Name Value
与其他答案一样,NA表示空白

library(data.table)
setDT(DT)[, New := Value[.N < 3], by=rleid(Value)][rowid(rleid(Value)) == 1L, New := Value]
DT
#    Name Value New
# 1:    A    24  24
# 2:    A    24  24
# 3:    A    45  45
# 4:    A    45  NA
# 5:    A    45  NA
# 6:    A    45  NA
# 7:    A    45  NA
# 8:    A    93  93
# 9:    A    19  19
#10:    A    19  19
#11:    A    10  10
#12:    B    29  29
#13:    B    67  67
#14:    B    67  NA
#15:    B    67  NA
#16:    B    67  NA
#17:    C   201 201
#18:    C   993 993
#19:    C   396 396
#20:    A    19  19
#21:    A    19  NA
#22:    C    19  NA
#23:    B    29  29
#24:    B    67  67
#25:    B    67  NA
#26:    B    67  NA
#27:    B    67  NA
#28:    C    67  NA
#29:    C    67  NA
#30:    C    67  NA
#31:    C    67  NA
#    Name Value New
如果名称变更预计也会重新启动值,则此变量可用于Jaap:

setDT(DT)[, New := Value[.N < 3], by = rleid(Name, Value)
          ][is.na(New) & rowid(rleid(Name, Value)) == 1L, New := Value][]
#    Name Value New
# 1:    A    24  24
# 2:    A    24  24
# 3:    A    45  45
# 4:    A    45  NA
# 5:    A    45  NA
# ...
#18:    C   993 993
#19:    C   396 396
#20:    A    19  19
#21:    A    19  19
#22:    C    19  19
#23:    B    29  29
#24:    B    67  67
#25:    B    67  NA
#26:    B    67  NA
#27:    B    67  NA
#28:    C    67  67
#29:    C    67  NA
#30:    C    67  NA
#31:    C    67  NA
#    Name Value New
请注意第21、22和27行中的差异

数据
请注意,第1行和第8行已被重复。OP的数据集覆盖了两次重复的情况,并且在末尾添加了一个fews行。

这里已经回答了这个问题:可能重复@Larusson,这不是重复的问题。海报需要替换显示超过3倍的数据值,所有答案都涉及重复的普通答案或数据。表中未说明顺序重复的问题。从请求中也不清楚提问者是否理解这个问题。如果一个值恰好重复两次,你会怎么想?不幸的是,您的示例数据不包括这种情况。您可以将其缩短。对于组内重复项:dt[,New_Value:=Value][duplicatedValue&.N>3,New_Value:=NA_integer,by=.Name,Value]。如您所见,您不必创建count变量,您可以直接在i.@Jaap中使用.N,这似乎不起作用,因为duplicatedValue&.N>3部分没有按变量分组。dt[,.N,by=is.naNew_值]的结果对于我的和你的不一样。它给了我正确的输出。您是否已更新到data.table v1.10.4的最新版本?很抱歉误解,您是正确的。我原以为.N现在可以用在I中,但我在新闻文件中找不到它;显然情况并非如此:-/修复了代码中的一个输入错误。。。使用数据而不是df作为数据帧名称,但从概念上讲,这正是相同的请注意,如果相同的值重复两次,您的解决方案还将为第二个值添加NA。OP要求仅当值重复三次或更多次时才出现空白。不幸的是,OP的样本数据集不包括两次重复的情况。我在第一次阅读时不知何故错过了>3次。我更新了代码,假设数据帧中的行顺序不重要,我认为:=v[logi]是一件奇怪的事情。当logi为false时,RHS的长度为零,我希望这会破坏一切。。。如果没有,那么在我看来,这依赖于一个奇怪的实现怪癖。不管怎样,我想我希望看到x:=如果logi v else v[NA_integer_u]@Frank自1.9.8版以来,这不再奇怪:即使没有匹配项或其RHS为长度0,新列也保证为:=参见中的第二项。感谢指针。嗯,古怪是主观的:出于同样的原因,我喜欢logi.SD超过.SD[logi]。一般来说,当出现错误时,这两个奇怪/怪异语法方面的问题都会消失
setDT(DT)[, New := Value[.N < 3], by = rleid(Name, Value)
          ][is.na(New) & rowid(rleid(Name, Value)) == 1L, New := Value][]
#    Name Value New
# 1:    A    24  24
# 2:    A    24  24
# 3:    A    45  45
# 4:    A    45  NA
# 5:    A    45  NA
# ...
#18:    C   993 993
#19:    C   396 396
#20:    A    19  19
#21:    A    19  19
#22:    C    19  19
#23:    B    29  29
#24:    B    67  67
#25:    B    67  NA
#26:    B    67  NA
#27:    B    67  NA
#28:    C    67  67
#29:    C    67  NA
#30:    C    67  NA
#31:    C    67  NA
#    Name Value New
DT <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "A", "A", 
"C", "B", "B", "B", "B", "B", "C", "C", "C", "C"), Value = c(24L, 
24L, 45L, 45L, 45L, 45L, 45L, 93L, 19L, 19L, 10L, 29L, 67L, 67L, 
67L, 67L, 201L, 993L, 396L, 19L, 19L, 19L, 29L, 67L, 67L, 67L, 
67L, 67L, 67L, 67L, 67L)), .Names = c("Name", "Value"), row.names = c(NA, 
-31L), class = "data.frame")