Arrays 关联两个数据帧与另一个数组中的元素相等的索引_Arrays_R_Dataframe_Vectorization

Arrays 关联两个数据帧与另一个数组中的元素相等的索引

arrays r dataframe

Arrays 关联两个数据帧与另一个数组中的元素相等的索引,arrays,r,dataframe,vectorization,Arrays,R,Dataframe,Vectorization,我有一个数组cluster_true和一个数据帧数据，每行包含一个二维坐标。我想在另一个数据框中保存关于给定2D坐标下集群中每个元素出现多少次的信息。例如，对于坐标1，1，我想检查数据中前两列值为1的所有行，然后检查这些索引处的cluster_true值。下面是一个示例，以使其更清晰，它给出了所需的结果： # Example variables cluster_true = c(1,2,1,1,2,2,1,2,2,2,2,1,1) x = 3 y = 3 data = data.frame(X

我有一个数组cluster_true和一个数据帧数据，每行包含一个二维坐标。我想在另一个数据框中保存关于给定2D坐标下集群中每个元素出现多少次的信息。例如，对于坐标1，1，我想检查数据中前两列值为1的所有行，然后检查这些索引处的cluster_true值。下面是一个示例，以使其更清晰，它给出了所需的结果：

# Example variables
cluster_true = c(1,2,1,1,2,2,1,2,2,2,2,1,1)
x = 3
y = 3
data = data.frame(X = c(1,1,0,0,2,1,1,0,0,0,1,1,1),
                  Y = c(1,1,2,1,2,2,1,0,0,0,0,2,0))

# Names of the columns
plot_colnames = c('X', 'Y', paste('cluster',unique(cluster_true),sep='_'))
# Empty dataframe with the right column names
plot_df = data.frame(matrix(vector(), x*y, length(plot_colnames),
                            dimnames=list(c(), plot_colnames)),
                     stringsAsFactors=F)
# Each row belongs to a certain 2D coordinate
plot_df$X = rep(1:x, y)-1
plot_df$Y = rep(1:x, each = y)-1
# This is what I don't know how to improve
for(i in 1:nrow(plot_df)){
  idx = which(apply(data[,1:2], 1, function(x) all(x == plot_df[i,1:2])))
  plot_df[i,3] = sum(cluster_true[idx] == 1)
  plot_df[i,4] = sum(cluster_true[idx] == 2)
}
print(plot_df)

我需要改变的事情，我不知道如何：

我认为循环可以避免，以获得更优雅的解决方案，但我不知道如何避免。dataframe数据可能有大量的行，因此高效的代码将非常棒。在循环中，我已经对集群进行了硬编码，以检查循环中的最后两行，假设我知道集群中存在哪些数字，它们对应于plot\u df的哪一列。事实上，cluster_true中的元素可以是任何东西，甚至是非连续数字，即cluster_true=c1,5,5,56,10,19,10。

因此，基本上，我想知道这是否可以在没有循环的情况下完成，并且尽可能通用。

如果我理解正确，OP希望

查找数据中X、Y坐标的所有唯一组合的行索引，在cluster_true的对应行中查找该值，计算给定X、Y和Y组合中每个值的出现次数以宽格式打印结果。这可以通过连接和重塑来解决：

library(data.table) # version 1.11.4 used
library(magrittr)   # use piping to improve readability
# unique coordinate pairs
uni_coords <- unique(setDT(data)[, .(X, Y)])[order(X, Y)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI] %>% 
  # reshape from long to wide format, thereby counting occurrences
  dcast(X + Y ~ sprintf("cluster_%02i", V1), length)

这与OP的预期结果相同，但坐标组合未出现在数据中

重塑的好处是它可以按照OP的请求处理集群中的任意值

编辑 OP建议所有可能的X、Y坐标组合应包括在最终结果中。这可以通过使用交叉连接CJ来计算uni_坐标来实现：

数据中出现的坐标1,3,2,2和3,1是否从绘图_df中丢失？Uwe抱歉，我给变量y的值错误。我已经改正了。谢谢你指出！谢谢你的回复。如果您可以添加如何获取最终结果中缺少的行，即从第一个数据帧获取文章中的第二个数据帧，那么这确实是可以接受的答案。

   X Y cluster_01 cluster_02
1: 1 1          2          1
2: 1 2          1          1
3: 1 3          1          1
4: 2 2          0          1
5: 3 1          1          0
6: 3 2          1          0
7: 3 3          0          3

setDT(plot_df)[order(X, Y)]

   X Y cluster_1 cluster_2
1: 1 1         2         1
2: 1 2         1         1
3: 1 3         1         1
4: 2 1         0         0
5: 2 2         0         1
6: 2 3         0         0
7: 3 1         1         0
8: 3 2         1         0
9: 3 3         0         3

# all possible  coordinate pairs
uni_coords <- setDT(data)[, CJ(X = X, Y = Y, unique = TRUE)]
# join and lookup values in cluster_true
data[uni_coords, on = .(X, Y), cluster_true[.I], by = .EACHI][
  uni_coords, on = .(X, Y)] %>% 
  # reshape from long to wide format, thereby counting occurrences
  dcast(X + Y ~ sprintf("cluster_%02i", V1), length) %>% 
  # remove NA column from reshaped result
  .[, cluster_NA := NULL] %>% 
  print()

   X Y cluster_01 cluster_02
1: 1 1          2          1
2: 1 2          1          1
3: 1 3          1          1
4: 2 1          0          0
5: 2 2          0          1
6: 2 3          0          0
7: 3 1          1          0
8: 3 2          1          0
9: 3 3          0          3