R中的对称对_R_Loops - Fatal编程技术网

R中的对称对

r loops

R中的对称对,r,loops,R,Loops,我有一个大数据框，看起来像这样： > my_table track_fid start_gid end_gid 1 1 100 82 2 2 82 100 3 3 100 82 4 4 100 32 5 5 82 100 6 6 82 100 7

我有一个大数据框，看起来像这样：

> my_table
   track_fid start_gid end_gid
1          1       100      82
2          2        82     100
3          3       100      82
4          4       100      32
5          5        82     100
6          6        82     100
7          7        82     100
8          8       100      82
9          9        34     100
10        10        31     100

我的目标是在末尾将列

添加到_from

，并用字符

或

填充它

让我们以第一行为例，

start\u gid

=100和

end\u gid

=82中的值。如果表中的任何位置存在另一行，其中的值是相反的，即

end\u gid

=100，而

start\u gid

=82，我想用

填充这两行的

to\u from

列。如果不存在反向，则第一行应填充n。这里的关键是循环每一行，并根据

track\u fid

的顺序在表中搜索它的倒数。如果在

轨迹fid

较大的位置发现反向，则应插入y。一旦一个逆函数接收到一个值

，它就不能再使用了

例如，这将是一个示例输出：

> output
   track_fid start_gid end_gid to_from
1          1       100      82       y
2          2        82     100       y
3          3       100      82       y
4          4       100      32       n
5          5        82     100       y
6          6        82     100       y
7          7        82     100       n
8          8       100      82       y
9          9        34     100       n
10        10        31     100       n

有没有办法在R中创建这样的输出

大致如下：

for(i in 2:nrow(my_table)) {
if(my_table[i-1,"start_gid"]= my_table[i,"end_gid"]) {
my_table$to_from = "y" } else { my_table$to_from = "n"}


> str(output)
'data.frame':   10 obs. of  4 variables:
 $ track_fid: int  1 2 3 4 5 6 7 8 9 10
 $ start_gid: int  100 82 100 100 82 82 82 100 34 31
 $ end_gid  : int  82 100 82 32 100 100 100 82 100 100
 $ to_from  : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 1 2 1 1

我看不到一种在R中没有循环的方法。你可以使用

for

循环和

next

和

break

语句来实现这一点。但在这种情况下，如果问题规模很大，我会转向Rcpp

library(Rcpp)
sourceCpp(code = "
          #include <Rcpp.h>
          // [[Rcpp::export]]
          Rcpp::LogicalVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
            Rcpp::LogicalVector res(x.length());
            for (int i=0; i<(x.length()-1); i++) {
              if(res(i)) continue;
              for (int j=i+1; j<x.length(); j++) {
                if (res(j)) continue;
                if (x(i) == y(j) && x(j) == y(i)) {
                   res(i) = true;
                   res(j) = true;
                   break;
                }
              }
            }
            return res;
          }
          ")

DF$from_to <- myfun(DF$start_gid, DF$end_gid)
#   track_fid start_gid end_gid from_to
#1          1       100      82    TRUE
#2          2        82     100    TRUE
#3          3       100      82    TRUE
#4          4       100      32   FALSE
#5          5        82     100    TRUE
#6          6        82     100    TRUE
#7          7        82     100   FALSE
#8          8       100      82    TRUE
#9          9        34     100   FALSE
#10        10        31     100   FALSE

库（Rcpp）
sourceCpp（代码=”
#包括
//[[Rcpp:：导出]]
Rcpp:：LogicalVector myfun（常量Rcpp:：IntegerVector x，常量Rcpp:：IntegerVector y）{
Rcpp:：LogicalVector res（x.length（））；
对于（int i=0；i，由于对算法的详细描述，我们还可以使用data.table
构建一个不带循环的不同解决方案
首先，我们计算start\u gid
和end\u gid
的唯一组合：
pairs <- dt[, .N, by = .(start_gid, end_gid)]
pairs

#   start_gid end_gid N
#1:       100      82 3
#2:        82     100 4
#3:       100      32 1
#4:        34     100 1
#5:        31     100 1

现在我们连接两个表。这是一个右连接，因此dt
的所有行都显示在输出中：
out <- pairs[dt, on = .(start_gid, end_gid)]
out
#       grp start_gid end_gid nmatch track_fid
# 1: 82_100       100      82      3         1
# 2: 82_100        82     100      3         2
# 3: 82_100       100      82      3         3
# 4: 32_100       100      32      0         4
# 5: 82_100        82     100      3         5
# 6: 82_100        82     100      3         6
# 7: 82_100        82     100      3         7
# 8: 82_100       100      82      3         8
# 9: 34_100        34     100      0         9
#10: 31_100        31     100      0        10

基准1：原始数据（10行）
使用只有10行的原始数据集，将数据表
解决方案与罗兰的Rcpp
解决方案进行比较：
代码
正如预期的那样，Rcpp函数本身比data.table
解决方案（对于给定的玩具大小样本数据）快20倍以上。但是，如果包含对sourceCPP
的调用，则需要比data.table
解决方案快两倍以上的时间
请注意，data.table
代码已通过链接data.table
查询进行压缩
基准2：较大的数据集
根据@Roland的建议，我将数据表
解决方案与大型数据集上的Rcpp进行了比较：

对于小于1000行的数据帧，Rcpp
比data.table
解决方案快。对于更大的数据帧，data.table
解决方案比Rcpp
解决方案的伸缩性好得多。注意，这是实现的算法的一个特点，一般不能归因于Rcpp
所有。
创建sum
，prod
，然后创建groupby（）
有助于进一步改进提供的“正确答案似乎有缺陷。有偶数（82100）成对。请更正示例或细化问题。@42-不，示例很好。第7行没有匹配项。确切地说，输出是正确的。操作顺序是问题所在-通过提升轨迹，fid是问题所在。我相信我在最后一步之前发现了一个问题。我也会将排序为out
比Rcpp解决方案慢一点。@Roland我也相信Rcpp解决方案更快。data.table
解决方案需要4个调用，3个分组，一个连接。（我猜周围有一些data.table
专家比我更知道如何做到这一点）无论如何，找到一种不同的方法是一个有趣的练习。耗时的部分可能不是数据表操作，而是调用pmin
/pmax
和粘贴
以及向量扫描（您是否使用OP的示例进行基准测试？因为这太小了。@罗兰在您建议添加的较大数据集上进行基准测试。
out <- pairs[dt, on = .(start_gid, end_gid)]
out
#       grp start_gid end_gid nmatch track_fid
# 1: 82_100       100      82      3         1
# 2: 82_100        82     100      3         2
# 3: 82_100       100      82      3         3
# 4: 32_100       100      32      0         4
# 5: 82_100        82     100      3         5
# 6: 82_100        82     100      3         6
# 7: 82_100        82     100      3         7
# 8: 82_100       100      82      3         8
# 9: 34_100        34     100      0         9
#10: 31_100        31     100      0        10

out <- out[, .(track_fid, to_from = seq_len(.N) <= nmatch), by = .(start_gid, end_gid)]
out[order(track_fid)]
     start_gid end_gid track_fid to_from
# 1:       100      82         1    TRUE
# 2:        82     100         2    TRUE
# 3:       100      82         3    TRUE
# 4:       100      32         4   FALSE
# 5:        82     100         5    TRUE
# 6:        82     100         6    TRUE
# 7:        82     100         7   FALSE
# 8:       100      82         8    TRUE
# 9:        34     100         9   FALSE
#10:        31     100        10   FALSE

library(microbenchmark)
microbenchmark(
  dt = {
    dt[, .N, by = .(start_gid, end_gid)][
      , .(start_gid, end_gid, nmatch = if (.N <= 1L) 0L else min(N)), 
      by = .(grp = paste(pmin(start_gid, end_gid), pmax(start_gid, end_gid), sep = "_"))][
        dt, on = .(start_gid, end_gid)][
          , .(track_fid, to_from = seq_len(.N) <= nmatch), 
          by = .(start_gid, end_gid)][
            order(track_fid)]
  },
  rcpp_source = {
    sourceCpp(code = "
          #include <Rcpp.h>
          // [[Rcpp::export]]
          Rcpp::LogicalVector myfun(const Rcpp::IntegerVector x, const Rcpp::IntegerVector y) {
            Rcpp::LogicalVector res(x.length());
            for (int i=0; i<(x.length()-1); i++) {
              if(res(i)) continue;
              for (int j=i+1; j<x.length(); j++) {
                if (res(j)) continue;
                if (x(i) == y(j) && x(j) == y(i)) {
                   res(i) = true;
                   res(j) = true;
                   break;
                }
              }
            }
            return res;
          }
          ")
    dt$from_to <- myfun(dt$start_gid, dt$end_gid)
    dt
  },
  rcpp_func = {
    dt$from_to <- myfun(dt$start_gid, dt$end_gid)
    dt
  }
)

Unit: microseconds
        expr      min       lq      mean    median       uq       max neval
          dt 2873.017 3233.418 3466.5484 3408.0495 3558.705  6345.633   100
 rcpp_source 8112.335 8537.114 8932.8953 8811.2385 9173.150 12093.931   100
   rcpp_func  101.192  121.582  142.0769  137.4405  154.620   255.246   100