R “只保留”;“集群”;元素数最多的
从示例数据开始:R “只保留”;“集群”;元素数最多的,r,classification,R,Classification,从示例数据开始: > dput(data) structure(list(Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("France", "Spain"), class = "factor"), Car = structure(c(6L, 17L,
> dput(data)
structure(list(Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("France", "Spain"), class = "factor"),
Car = structure(c(6L, 17L, 7L, 18L, 4L, 13L, 20L, 5L, 14L,
21L, 8L, 11L, 15L, 9L, 12L, 16L, 8L, 11L, 15L, 9L, 12L, 19L,
3L, 10L, 1L, 2L), .Label = c("Audi_1_EON", "Audi_2_EON",
"Ferrari_1_EOD", "Fiat_1_EOD", "Fiat_1_EON", "Mazda_1_EOD",
"Mazda_1_EON", "Mercedes_1_EOD", "Mercedes_1_EON", "Mercedes_2_EOD",
"Nexia_1_EOD", "Nexia_1_EON", "Opel_1_EOD", "Opel_1_EON",
"Peugeot_1_EOD", "Peugeot_1_EON", "Porsche_2_EOD", "Porsche_2_EON",
"Tico_1_EON", "VW_1_EOD", "VW_1_EON"), class = "factor"),
ValueOfComp = c(13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L,
14L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 16L,
16L, 12L, 12L, 12L, 12L)), .Names = c("Country", "Car", "ValueOfComp"
), class = "data.frame", row.names = c(NA, -26L))
在提供的数据中
我们在第一列中有两个不同的国家。在下一列中,我们可以找到分配给每个国家的汽车,在最后一列中是集群的数量
我只想在表中列出每个国家的一个单一类别。它必须是每个国家最大的集群。让我们以法国为例。将两个集群(13
和14
)分配给该国家。很明显,集群14包含更多的元素/汽车。在这种情况下,我希望保留集群14,并从数据中删除集群13
所提供的数据只是一个例子。我的真实数据是一个庞大的表,因此我相信在某些情况下,集群可能包含相同数量的元素,因此,哪一个元素将保留在数据中并不重要。使用
dplyr
可以:
library(data.table)
data[ValueOfComp %in% setDT(data)[,rle(ValueOfComp), Country][
,values[which.max(lengths)], Country]$V1,]
Country Car ValueOfComp
1: France Fiat_1_EOD 14
2: France Opel_1_EOD 14
3: France VW_1_EOD 14
4: France Fiat_1_EON 14
5: France Opel_1_EON 14
6: France VW_1_EON 14
7: Spain Mercedes_1_EOD 15
8: Spain Nexia_1_EOD 15
9: Spain Peugeot_1_EOD 15
10: Spain Mercedes_1_EON 15
11: Spain Nexia_1_EON 15
12: Spain Peugeot_1_EON 15
data %>%
group_by(Country, ValueOfComp) %>%
mutate(size = n()) %>%
group_by(Country) %>%
filter(size == max(size), ValueOfComp == max(ValueOfComp))
Source: local data frame [12 x 4]
Groups: Country [2]
Country Car ValueOfComp size
(fctr) (fctr) (int) (int)
1 France Fiat_1_EOD 14 6
2 France Opel_1_EOD 14 6
3 France VW_1_EOD 14 6
4 France Fiat_1_EON 14 6
5 France Opel_1_EON 14 6
6 France VW_1_EON 14 6
7 Spain Mercedes_1_EOD 16 6
8 Spain Nexia_1_EOD 16 6
9 Spain Peugeot_1_EOD 16 6
10 Spain Mercedes_1_EON 16 6
11 Spain Nexia_1_EON 16 6
12 Spain Tico_1_EON 16 6
我们可以使用
plyr
包和subset
获得
ddply(dat, "Country", subset, ValueOfComp == count(ValueOfComp)$x[which.max(count(ValueOfComp)$freq)])
# Country Car ValueOfComp
#1 France Fiat_1_EOD 14
#2 France Opel_1_EOD 14
#3 France VW_1_EOD 14
#4 France Fiat_1_EON 14
#5 France Opel_1_EON 14
#6 France VW_1_EON 14
#7 Spain Mercedes_1_EOD 15
#8 Spain Nexia_1_EOD 15
#9 Spain Peugeot_1_EOD 15
#10 Spain Mercedes_1_EON 15
#11 Spain Nexia_1_EON 15
#12 Spain Peugeot_1_EON 15
如果有两个大小相同的集群,会发生什么?两个都保留,还是随机选择一个?我在最后一段提到过。随机选择。这对你来说太容易了:)。下面是一种默认选择第一组的方法。