Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/83.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 折叠数据表中的冗余行_R_Data.table - Fatal编程技术网

R 折叠数据表中的冗余行

R 折叠数据表中的冗余行,r,data.table,R,Data.table,我有一个以下格式的数据表: myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6) print(myTable) Col1 Col2 1: A 1 2: A 2 3: A 3 4: B 4 5: B 5 6: B 6 我用以下代码成功地做到了这一点: unique <- unique(myTable$Col1)

我有一个以下格式的数据表:

myTable <- data.table(Col1 = c("A", "A", "A", "B", "B", "B"), Col2 = 1:6)
print(myTable)

   Col1 Col2
1:    A    1
2:    A    2
3:    A    3
4:    B    4
5:    B    5
6:    B    6
我用以下代码成功地做到了这一点:

unique <- unique(myTable$Col1)                                  # unique values in Col1
myTable2 <- data.table()                                        # empty data table to populate
for(each in unique){
    temp <- myTable[Col1 == each, ]                             # filter myTable for unique Col1 values
    temp <- temp[order(-Col2)]                                  # order filtered table increasingly
    sumCol2 <- sum(temp$Col2)                                   # sum of values in filtered Col2
    temp <- temp[1, ] # retain only first element
    remSum <- sumCol2 - sum(temp$Col2)                          # remaining sum in Col2 (without first element)
    temp <- rbindlist(list(temp, data.table("Others", remSum))) # rbind first element and remaining elements
    myTable2 <- rbindlist(list(myTable2, temp))                 # populate data table from beginning
}

unique这里,数据根据
Col1
的值分成组(
by=Col1
.N
是给定组中最后一行的索引,因此
c(Col2[.N],sum(Col2)-Col2[.N])
给出
Col2
的最后一个值,
Col2
的和减去最后一个值。新创建的变量被
()
包围,因为
()
是使用data.table时
list()
函数的别名,创建的列需要放在列表中

library(data.table)
setDT(df)

df[, .(Col1 = c(Col1, 'Others'),
       Col2 = c(Col2[.N], sum(Col2) - Col2[.N]))
  , by = Col1][, -1]
#      Col1 Col2
# 1:      A    3
# 2: Others    3
# 3:      B    6
# 4: Others    9

如果只是显示一些东西,您可以使用“表”包:

others <- function(x) sum(x)-last(x)
df %>% tabular(Col1*(last+others) ~ Col2, .)

# Col1        Col2
# A    last   3   
#      others 3   
# B    last   6   
#      others 9
others%表格(Col1*(last+others)~Col2,)
#Col1 Col2
#A最后三个
#其他3
#B最后6点
#其他9
我做到了!我做了一张新桌子来说明。我只想按类别保留最高的4个值,并折叠其他值

set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)

    Col1 Col2
 1:    A    8
 2:    A    5
 3:    A    2
 4:    B    7
 5:    B   10
 6:    B    9
 7:    B   12
 8:    B   11
 9:    C    4
10:    C    6
11:    C    3
12:    C    1

# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)

# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]

# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum 
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]

# removes rows with NA inserted in first step
myTable <- na.omit(myTable)

# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4    entries
myTable <- myTable[Col2 != 0]
set.seeed(123)

myTable这是一个基本的R解决方案和
dplyr
等价物:

res <- aggregate(Col2 ~.,transform(
  myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
#    Col0 Col2
# 1     A    3
# 3 Other    3
# 2     B    6
# 4 Other    9

myTable %>%
  group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
  summarize_at("Col2",sum) %>%
  ungroup %>%
  select(-1)
# # A tibble: 4 x 2
#   Col1   Col2
#   <chr> <int>
# 1 A         3
# 2 Other     3
# 3 B         6
# 4 Other     9
res%
分组依据(Col0=Col1,Col1=replace(Col1,duplicated(Col1,fromLast=TRUE),“其他”))%>%
汇总在(“Col2”,总和)%>%
解组%>%
选择(-1)
##tibble:4 x 2
#Col1 Col2
#    
#1 A 3
#2其他3
#3 B 6
#4其他9

“我只想在Col1中显示每个类别的第一个结果”
看起来您想显示最后一个结果。实际上我指的是最高值,更正了它,谢谢。好吧。。。。这改变了一切。我更新了我的解决方案。这对于分析来说是一种不好的格式,因为根据当前排序,您无法再将第一个“其他”识别为与“除非”关联。无论如何,如果您要“更新”,可能是
myTable[order(-Col2),lapply(.SD,sum),by=(Col1,r=as.character(replace(r5,“other”))]
,不过您需要提供一个相关示例,以便我们可以确认。。。既然已经发布了这么多答案,如果你想不出来,你可以发布一个新问题。太好了!你介意解释一下语法吗?虽然我很喜欢它,但我仍然是data.table包的初学者。实际上,我的过程有点复杂:我想保留前5个条目,但是Col1中的一些值没有5个条目;在这些情况下,应保留所有条目,不应包括“其他”行。
set.seeed(123)
myTable <- data.table(Col1 = c(rep("A", 3), rep("B", 5), rep("C", 4)), Col2 = sample(1:12, 12))
print(myTable)

    Col1 Col2
 1:    A    8
 2:    A    5
 3:    A    2
 4:    B    7
 5:    B   10
 6:    B    9
 7:    B   12
 8:    B   11
 9:    C    4
10:    C    6
11:    C    3
12:    C    1

# set key to Col2, it will sort it increasingly
setkey(myTable, Col2)

# if there are more than 4 entries by Col1 category, will return all information, otherwise will return 4 entries completing with NA
myTable <- myTable[,.(Col2 = Col2[1:max(c(4, .N))]) , by = Col1]

# will print in Col1: 4 entries of Col1 category, then "Other"
# will print in Col2: 4 last entries of Col2 in that category, then the remaining sum 
myTable <- myTable[, .(Col1 = c(rep(Col1, 4), "Other"), Col2 = c(Col2[.N-3:0], sum(Col2) - sum(Col2[.N-3:0]))), by = Col1]

# removes rows with NA inserted in first step
myTable <- na.omit(myTable)

# removes rows where Col2 = 0, inserted because that Col1 category had exactly 4    entries
myTable <- myTable[Col2 != 0]
res <- aggregate(Col2 ~.,transform(
  myTable, Col0 = replace(Col1,duplicated(Col1,fromLast = TRUE), "Other")), sum)
res[order(res$Col1),-1]
#    Col0 Col2
# 1     A    3
# 3 Other    3
# 2     B    6
# 4 Other    9

myTable %>%
  group_by(Col0= Col1, Col1= replace(Col1,duplicated(Col1,fromLast = TRUE),"Other")) %>%
  summarize_at("Col2",sum) %>%
  ungroup %>%
  select(-1)
# # A tibble: 4 x 2
#   Col1   Col2
#   <chr> <int>
# 1 A         3
# 2 Other     3
# 3 B         6
# 4 Other     9