R &引用；加入？“；在行中显示不同的值，然后匹配_R

R &引用；加入？“；在行中显示不同的值，然后匹配

R &引用；加入？“；在行中显示不同的值，然后匹配,r,R,为这个糟糕的标题道歉，我不知道该如何描述我的问题我的数据集如下所示： ---------------------------------- | media_id | filename | duration | ---------------------------------- | 782363 | 000041f1 | 12577 | ---------------------------------- | 782379 | 000041f1 | 12570 | ----

为这个糟糕的标题道歉，我不知道该如何描述我的问题

我的数据集如下所示：

----------------------------------
| media_id | filename | duration |
----------------------------------
|  782363  | 000041f1 |   12577  |
----------------------------------
|  782379  | 000041f1 |   12570  |
----------------------------------
|  1449109 | 00006c9b |  530423  |
----------------------------------
|  1449160 | 00006c9b |  530420  |
----------------------------------

我想做的是匹配唯一的文件名（最多只有两行匹配），如下所示：

目的是计算持续时间和持续时间2之间的绝对差值。对于上下文，原始文件名具有不同的文件扩展名，但我已经截断了它们，因为这是我需要匹配持续时间的方式。我试图看看在is从一种格式转换为另一种格式后，fileA的持续时间是否与fileB不同

我对dplyr很熟悉，但我能想到的最好的算法是

1-Identify the unique filenames
2-Search through the filename column using grep to locate the rows where the filenames are located
3-Somehow transform, or create a new data frame, that matches the filenames.

有什么想法/建议吗？数据集将有大约100万行，因此理想情况下，我需要一些性能相当好的数据。

您还必须重新塑造数据集

library(dplyr)
library(tidyr)

data_frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420) ) %>%
    group_by(filename) %>%
    mutate(sub_group = 1:n()) %>%
    gather(variable, value, -filename, -sub_group) %>%
    unite(new_variable, variable, sub_group) %>%
    spread(new_variable, value) %>%
    mutate(duration.difference = duration_1 - duration_2)

dplyr

之外的另一个选项是使用

reformae2

的

dcast

。它本质上是一个未熔化/枢轴功能

library(reshpape2)
df <- data.frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420))

# Identify a file sequence (will be different with larger distributed file).  
# Will work if file is sorted by filename and has exactly two records per filename.
df$file_seq <- paste('d', rep(1:2), sep='')

# unmelt
df2 <- dcast(data = df, formula = filename ~ file_seq, value.var = 'duration')

# calculate the difference
df2$diff <- abs(df2$d1 - df2$d2)

库（reshpape2）
df我尝试实施您的解决方案，但R不断崩溃。我的猜测是内存不足。有什么建议吗？当然可以先尝试一小部分数据，以找出任何错误。但是，如果您真的想加快速度，您可能需要转向data.table。这里有一群人是数据表向导，但不是我
library(reshpape2)
df <- data.frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420))

# Identify a file sequence (will be different with larger distributed file).  
# Will work if file is sorted by filename and has exactly two records per filename.
df$file_seq <- paste('d', rep(1:2), sep='')

# unmelt
df2 <- dcast(data = df, formula = filename ~ file_seq, value.var = 'duration')

# calculate the difference
df2$diff <- abs(df2$d1 - df2$d2)