R &引用;加入?“;在行中显示不同的值,然后匹配

R &引用;加入?“;在行中显示不同的值,然后匹配,r,R,为这个糟糕的标题道歉,我不知道该如何描述我的问题 我的数据集如下所示: ---------------------------------- | media_id | filename | duration | ---------------------------------- | 782363 | 000041f1 | 12577 | ---------------------------------- | 782379 | 000041f1 | 12570 | ----

为这个糟糕的标题道歉,我不知道该如何描述我的问题

我的数据集如下所示:

----------------------------------
| media_id | filename | duration |
----------------------------------
|  782363  | 000041f1 |   12577  |
----------------------------------
|  782379  | 000041f1 |   12570  |
----------------------------------
|  1449109 | 00006c9b |  530423  |
----------------------------------
|  1449160 | 00006c9b |  530420  |
----------------------------------
我想做的是匹配唯一的文件名(最多只有两行匹配),如下所示:

目的是计算持续时间和持续时间2之间的绝对差值。对于上下文,原始文件名具有不同的文件扩展名,但我已经截断了它们,因为这是我需要匹配持续时间的方式。我试图看看在is从一种格式转换为另一种格式后,fileA的持续时间是否与fileB不同

我对dplyr很熟悉,但我能想到的最好的算法是

1-Identify the unique filenames
2-Search through the filename column using grep to locate the rows where the filenames are located
3-Somehow transform, or create a new data frame, that matches the filenames.

有什么想法/建议吗?数据集将有大约100万行,因此理想情况下,我需要一些性能相当好的数据。

您还必须重新塑造数据集

library(dplyr)
library(tidyr)

data_frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420) ) %>%
    group_by(filename) %>%
    mutate(sub_group = 1:n()) %>%
    gather(variable, value, -filename, -sub_group) %>%
    unite(new_variable, variable, sub_group) %>%
    spread(new_variable, value) %>%
    mutate(duration.difference = duration_1 - duration_2)

dplyr
之外的另一个选项是使用
reformae2
dcast
。它本质上是一个未熔化/枢轴功能

library(reshpape2)
df <- data.frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420))

# Identify a file sequence (will be different with larger distributed file).  
# Will work if file is sorted by filename and has exactly two records per filename.
df$file_seq <- paste('d', rep(1:2), sep='')

# unmelt
df2 <- dcast(data = df, formula = filename ~ file_seq, value.var = 'duration')

# calculate the difference
df2$diff <- abs(df2$d1 - df2$d2)
库(reshpape2)

df我尝试实施您的解决方案,但R不断崩溃。我的猜测是内存不足。有什么建议吗?当然可以先尝试一小部分数据,以找出任何错误。但是,如果您真的想加快速度,您可能需要转向data.table。这里有一群人是数据表向导,但不是我
library(reshpape2)
df <- data.frame(
  media_id = c(782363, 782379, 1449109, 1449160),
  filename = c("000041f1", "000041f1", "00006c9b", "00006c9b"),
  duration = c(12577, 12570, 530423, 530420))

# Identify a file sequence (will be different with larger distributed file).  
# Will work if file is sorted by filename and has exactly two records per filename.
df$file_seq <- paste('d', rep(1:2), sep='')

# unmelt
df2 <- dcast(data = df, formula = filename ~ file_seq, value.var = 'duration')

# calculate the difference
df2$diff <- abs(df2$d1 - df2$d2)