如何在R中导入和排序格式不良的堆叠CSV文件

如何在R中导入和排序格式不良的堆叠CSV文件,r,csv,import,format,terminology,R,Csv,Import,Format,Terminology,我如何导入和排序这些数据(在代码部分之后),以便R能够轻松地进行操作 器官名称、剂量单位“Gy”、体积单位“CC”是否都考虑在内 R的“因素”?数据集名称和数据变量的术语是什么 这些直方图将一个数据集依次放置在另一个数据集之后,如下所示: 示例数据文件: Bladder,, GY, (CC), 0.0910151,1.34265 0.203907,1.55719 [skipping to end of this data set] 57.6659,0.705927 57.7787,0.19609

我如何导入和排序这些数据(在代码部分之后),以便R能够轻松地进行操作

  • 器官名称、剂量单位“Gy”、体积单位“CC”是否都考虑在内 R的“因素”?数据集名称和数据变量的术语是什么

  • 这些直方图将一个数据集依次放置在另一个数据集之后,如下所示:

    示例数据文件:

    Bladder,,
    GY, (CC),
    0.0910151,1.34265
    0.203907,1.55719
    [skipping to end of this data set]
    57.6659,0.705927
    57.7787,0.196091
    ,,
    CTV-operator,,
    GY, (CC),
    39.2238,0.00230695
    39.233,0
    [repeating for remainder of data sets; skipping to end of file]
    53.1489,0
    53.2009,0.0161487
    ,,
    [blank line]
    
    数据集标签(例如膀胱、CTV操作员、直肠)有时是小写的,并且通常在文件中以随机顺序排列。我在两个文件夹中分类了几十个文件,作为一个大型患者样本导入和分析

    我已开始编写此脚本,但我怀疑有更好的方法:

    [file = file.path()]
    DVH = read.csv(file, header = FALSE, sep = ",", fill = TRUE)
    
    DVH[3] <- NULL      # delete last column from data
    loop = 1; notover = TRUE
    factor(DVH[loop,1]) # Store the first element as a factor
    while(notover)
     {loop = loop + 1   # move to next line
      DVH$1<-factor(DVH[loop,1]) # I must change ...
      DVH$2<-factor(DVH[loop,2]) # ... these lines.
    
      if([condition indicating end of file; code to be learned]) {notover = FALSE}
     }
    # store first element as data label
    # store next element as data label
    # store data for (observations given) this factor
    # if line is blank, move to next line, store first element as new factor, and repeat until end of file
    
    其他信息:

    我是新手,有中级MATLAB经验。我正在从MATLAB切换到R,这样我的工作可能更容易被世界各地的其他人复制。(R是免费的;MATLAB不是。)

    该数据来自癌症治疗研究的剂量-体积直方图导出

    (但一位计算机科学家建议我改用R。)


    谢谢您的时间。

    这应该会将文件读入一个结构良好的数据框中,以便进一步处理。它将允许您处理多个文件并将数据合并到一个数据帧中。有更有效、更动态的方法来处理获取文件路径,但这应该为您提供一个起点

    # Create function to process a file
    process.file <- function(filepath){
      # Open connection to file
      con = file(filepath, "r")
    
      # Create empty dataframe
      df <- data.frame(Organ = character(),
                               Dosage = numeric(),
                               Dosage.Unit = character(),
                               Volume = numeric(),
                               Volumne.Unit = character(),
                               stringsAsFactors = FALSE)
    
      # Begin looping through file
      while ( TRUE )
      {
        # Read current line
        line <- readLines(con, n = 1)
        # If at end of file, break the loop
        if ( length(line) == 0 ) { break }
    
        # If the current line is not equal to ",," and is not a blank line, then process the line
        if(line != ",," & line != ""){
          # If the last two characters of the line are ",,"
          if(substr(line, nchar(line) - 1, nchar(line)) == ",,"){
            # Remove the commas from the line and set the organ type
            organ <- gsub(",,","",line)
          } 
          # If the last character of the line is equal to ","
          else if(substr(line, nchar(line), nchar(line)) == ","){
            # Split the line at the comma
            units <- strsplit(line,",")
    
            # Set the dosage unit and volume unit
            dose.unit <- units[[1]][1]
            vol.unit <- units[[1]][2]
          }
          # If the line is not a special case
          else{
            # Split the line at the comma
            vals <- strsplit(line,",")
    
            # Set the dosage value and the volume value
            dosage <- vals[[1]][1]
            volume <- vals[[1]][2]
    
            # Add the values into the dataframe
            df <- rbind(df, as.data.frame(t(c(organ,dosage,dose.unit,volume,vol.unit))))
          }
        }
      }
    
      # Set the column names for the dataframe
      colnames(df) <- c("Organ","Dosage","Dosage.Unit","Volume","Volume.Unit")
    
      # Close the connection to a file
      close(con)
    
      # Return the dataframe
      return(df)
    }
    
    
    # Create a vector of the files to process
    filenames <- c("C:/path/to/file/file1.txt",
                   "C:/path/to/file/file2.txt",
                   "C:/path/to/file/file3.txt",
                   "C:/path/to/file/file4.txt")
    
    # Create a dataframe to hold processed data
    df.patient.sample <- data.frame(Organ = character(),
                                    Dosage = numeric(),
                                    Dosage.Unit = character(),
                                    Volume = numeric(),
                                    Volumne.Unit = character(),
                                    stringsAsFactors = FALSE)
    
    # Process each file in the vector of filenames
    for(f in filenames){
      df.patient.sample <- rbind(df.patient.sample, process.file(f))
    }
    
    #创建处理文件的函数
    
    process.file这里有一个替代版本,它应该比在for循环中逐行处理文件快得多。此版本首先将整个数据文件读取到一个单列数据帧,然后清除数据,这应该比通过for循环处理快得多

    # Load required library
      library(tidyr)
    
    # Create function to process file
      process.file <- function(path){
    
      # Import data into a single column dataframe
        df <- as.data.frame(scan(path, character(), sep = "\n", quiet = TRUE), stringsAsFactors = FALSE)
    
      # Set column name
        colnames(df) <- "col1"
    
      # Copy organ names to new column
        df$organ <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{2}$", x)) == ",,", gsub('.{2}$', '', x), NA))
    
      # Fill organ name for all rows
        df <- fill(df, organ, .direction = "down")
    
      # Remove the rows that contained the organ
        df <- df[regmatches(df[,1], regexpr(".{2}$", df[,1])) != ",,", ]
    
      # Copy units into a new column
        df$units <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{1}$", x)) == ",", gsub('.{1}$', '', x), NA))
    
      # Fill units field for all rows
        df <- fill(df, units, .direction = "down")
    
      # Separate units into dose.unit and vol.unit columns
        df <- separate(df, units, c("dose.unit","vol.unit"), ", ")
    
      # Remove the rows that contained the units
        df <- df[regmatches(df[,1], regexpr(".{1}$", df[,1])) != ",", ]
    
      # Separate the remaining data into dosage and volume columns
        df <- separate(df, col1, c("dosage","volume"), ",")
    
      # Set data type of dosage and volume to numeric
        df[,c("dosage","volume")] <- lapply(df[,c("dosage","volume")], as.numeric)
    
      # Reorder columns
        df <- df[, c("organ","dosage","dose.unit","volume","vol.unit")]
    
      # Return the dataframe
      return(df)
    }
    
    # Set path to root folder directory
    source.dir <- # Path to root folder here
    
    # Retrieve all files from folder
    # NOTE: To retrieve all files from the folder and all of it's subfolders, set: recursive = TRUE
    # NOTE: To only include files with certain words in the name, include: pattern = "your.pattern.here"
    files <- list.files(source.dir, recursive = FALSE, full.names = TRUE)
    
    # Process each file and store dataframes in list
    ldf <- lapply(files, process.file)
    
    # Combine all dataframes to a single dataframe
    final.df <- do.call(rbind, ldf)
    
    #加载所需的库
    图书馆(tidyr)
    #创建处理文件的函数
    
    process.file标签始终紧跟在空行之后(第一行除外),列名始终紧跟在标签之后?Moody,是的,第二个数据集之后的标签始终紧跟在仅包含“,”的行之后(R Console将其打印为空行)--但文件末尾以最后一个空行为特征。我目前正在查看网页并试图解决此问题。感谢您帮助我入门并向我介绍语法和~17命令。但是,40秒读取两个文件(103 KB,106 KB)太慢(MATLAB可以在大约一半的时间内读取70个文件)。可以使用哪些命令或方法来加快速度?我想一种方法是重写循环,使其停止检查final“,”在单位行解析并导入所有剩余行,直到满足下一行“,”之后--是否有类似MathWorks的“textscan”这样的命令重复“直到失败”?我会搜索。我想做的另一个改变是不要重复存储后续的“GY”和“(CC)”。再次感谢。这在大约一秒钟内导入了两个文件,您向我介绍了大约30个命令、一个附加包和语法。我必须学习如何使用生成的列表,以及如何创建一个脚本来导入文件夹的所有内容,而不是列出每个文件的路径。实际上,在R中使用
    list.files()
    函数有一个非常简单的方法来实现这一点。我已经更新了我的答案,向您展示了如何做到这一点。如果您是R新手,那么您应该查看
    tidyverse
    软件包。它本质上是一组具有一些强大功能的包的汇编。网站tidyverse.org对所包含的内容进行了很好的细分。tidyverse不需要将所有数据放在tibble中,即复制数据以填充每个tibble行和列吗?我的数据是嵌套的:将其作为嵌套数据导入不是更好吗?即,患者(30个选项)、治疗计划(2个选项)、器官(5个选项)、组织切片数据(剂量和体积数据向量)。导入保持这种嵌套格式的数据,然后将每个数据集的指定部分复制到新的数据框中用于计算,而不是将所有内容强制放入一个巨大的表中,不是更好吗?(具体地说,从单个直方图计算平均直方图。)
    # Load required library
      library(tidyr)
    
    # Create function to process file
      process.file <- function(path){
    
      # Import data into a single column dataframe
        df <- as.data.frame(scan(path, character(), sep = "\n", quiet = TRUE), stringsAsFactors = FALSE)
    
      # Set column name
        colnames(df) <- "col1"
    
      # Copy organ names to new column
        df$organ <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{2}$", x)) == ",,", gsub('.{2}$', '', x), NA))
    
      # Fill organ name for all rows
        df <- fill(df, organ, .direction = "down")
    
      # Remove the rows that contained the organ
        df <- df[regmatches(df[,1], regexpr(".{2}$", df[,1])) != ",,", ]
    
      # Copy units into a new column
        df$units <- sapply(df[,1], function(x) ifelse(regmatches(x, regexpr(".{1}$", x)) == ",", gsub('.{1}$', '', x), NA))
    
      # Fill units field for all rows
        df <- fill(df, units, .direction = "down")
    
      # Separate units into dose.unit and vol.unit columns
        df <- separate(df, units, c("dose.unit","vol.unit"), ", ")
    
      # Remove the rows that contained the units
        df <- df[regmatches(df[,1], regexpr(".{1}$", df[,1])) != ",", ]
    
      # Separate the remaining data into dosage and volume columns
        df <- separate(df, col1, c("dosage","volume"), ",")
    
      # Set data type of dosage and volume to numeric
        df[,c("dosage","volume")] <- lapply(df[,c("dosage","volume")], as.numeric)
    
      # Reorder columns
        df <- df[, c("organ","dosage","dose.unit","volume","vol.unit")]
    
      # Return the dataframe
      return(df)
    }
    
    # Set path to root folder directory
    source.dir <- # Path to root folder here
    
    # Retrieve all files from folder
    # NOTE: To retrieve all files from the folder and all of it's subfolders, set: recursive = TRUE
    # NOTE: To only include files with certain words in the name, include: pattern = "your.pattern.here"
    files <- list.files(source.dir, recursive = FALSE, full.names = TRUE)
    
    # Process each file and store dataframes in list
    ldf <- lapply(files, process.file)
    
    # Combine all dataframes to a single dataframe
    final.df <- do.call(rbind, ldf)