Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/bash/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 根据长格式文件中的条件,仅选择日期和相关行_R_Bash_Text Processing - Fatal编程技术网

R 根据长格式文件中的条件,仅选择日期和相关行

R 根据长格式文件中的条件,仅选择日期和相关行,r,bash,text-processing,R,Bash,Text Processing,我有一个csv文件,可以导入到R。 这是一个“长格式”的数据帧,有许多列,即同一ID有多个条目。我正在复制一个数据集示例和我试图获得的结果数据集,仅使用前5列(我的真实数据中实际上有更多的列)。原始数据集可通过以下方式在R中复制: df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/20010 from Steve.","30/16/2005 from Anna.

我有一个csv文件,可以导入到R。 这是一个“长格式”的数据帧,有许多列,即同一ID有多个条目。我正在复制一个数据集示例和我试图获得的结果数据集,仅使用前5列(我的真实数据中实际上有更多的列)。原始数据集可通过以下方式在R中复制:

df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/20010 from Steve.","30/16/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve.","06/05/2004 from Allen.","08/09/2005 from Anna.","08/05/2008 from Allen. 30/10/2010 from Bobby.","14/03/2002 from Steve. 23/07/2003 from Anna.","08/08/2002 from Steve.", "08/08/2002 from Anna. 08/08/2002 from Steve."), v1 = c(1,NA,1,1,2,NA,1,2), v2 = c(2,NA,2,NA,NA,NA,2,NA), v3 = c(1,NA,NA,2,NA,1,1,NA), v4 = c("Y","N","N","Y","NA","NA","Y","Y"), v5 = c(0,0,NA,0,0,NA,0,NA))
然后我想在BASH中选择日期可能更好:但这也不能带我去任何地方:

try<- strsplit(df$date, "from", fixed=TRUE)
grep "[0-9][0-9]\/[0-9][0-9]\/20[0-9][0-9]" file.csv | less -S
基本上,我不知道如何处理这个问题。 如果有人能建议正确的方法,我将非常感激,希望只是使用BASH或R?
谢谢大家!

下面是一个用于示例数据的脚本

(更正两个日期后:
30/10/20010-->30/10/2010
30/16/2005-->30/06/2005


下面是一个用于示例数据的脚本

(更正两个日期后:
30/10/20010-->30/10/2010
30/16/2005-->30/06/2005


亲爱的yellowcap,非常感谢您的帮助,代码有效,此步骤后只有一些警告:“df.dates再次您好,我刚刚重新检查了这个,它给了我很多错误,问题是当我没有唯一的日期以及相同ID中的重复日期时,因此,例如,如果我添加第四个ID,如下所示:df亲爱的yellowcap,非常感谢您的帮助,代码有效,此步骤后只有一些警告:“df.dates您好,我刚刚重新检查了它,它给了我很多错误,问题是当我没有唯一的日期以及相同ID中的重复日期时,例如,如果我添加第四个ID,如:df
df <- data.frame(id = c("id1","id1","id2","id2","id2","id2","id3","id3"), date = c("30/10/2010 from Steve.","30/06/2005 from Anna. 09/08/2008 from Steve. 09/10/2009 from Steve.","06/05/2004 from Allen.","08/09/2005 from Anna.","08/05/2008 from Allen. 30/10/2010 from Bobby.","14/03/2002 from Steve. 23/07/2003 from Anna.","08/08/2002 from Steve.", "08/08/2002 from Anna. 08/08/2002 from Steve."), v1 = c(1,NA,1,1,2,NA,1,2), v2 = c(2,NA,2,NA,NA,NA,2,NA), v3 = c(1,NA,NA,2,NA,1,1,NA), v4 = c("Y","N","N","Y","NA","NA","Y","Y"), v5 = c(0,0,NA,0,0,NA,0,NA))

# Convert string to list of dates, extract maximum (earliest date)
df.dates <- sapply(as.character(df$date), function(x) strsplit(x, "\\."))
df.dates <- lapply(df.dates, function(x) as.Date(x, format='%d/%m/%Y'))
df.dates <- lapply(df.dates, max)

# Add to dataframe
df$latest <- unlist(df.dates)

# Count the number of NA values per row
df$naCount <- apply(df, 1, function(x) sum(is.na(x[3:7])))

# Split data and select either maximum (latest) date or minimal NA count
df.split <- split(df, df$id)

df.select <- lapply(df.split, function(x){

    # Select by given criterion
    if(length(unique(x$latest)) == 1){
        y <- x[which(min(x$naCount) == x$naCount),]
        }
    else{
        y <- x[which(max(x$latest) == x$latest),]
    }

    # Check if selection was successful
    if(nrow(y) != 1) cat('Warning: non-unique choice, returning more than one line')

    # Return result
    y
})

# Combine into output dataframe
df.select <- do.call(rbind, df.select)
> df.select

     id                                          date v1 v2 v3 v4 v5 latest naCount
id1 id1                        30/10/2010 from Steve.  1  2  1  Y  0  14912       0
id2 id2 08/05/2008 from Allen. 30/10/2010 from Bobby.  2 NA NA NA  0  14912       2
id3 id3                        08/08/2002 from Steve.  1  2  1  Y  0  11907       0