Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/69.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中,如何从分割数据的文本文件创建数据帧?_R_Import - Fatal编程技术网

在R中,如何从分割数据的文本文件创建数据帧?

在R中,如何从分割数据的文本文件创建数据帧?,r,import,R,Import,在R中,我试图导入一个具有以下结构的海量文本文件:这是一个保存为example.txt的示例: Curve Name: Curve A Curve Values: index Variable 1 Variable 2 [°C] [%] 0 30 100 1 40 95 2

在R中,我试图导入一个具有以下结构的海量文本文件:这是一个保存为example.txt的示例:

Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green 
到目前为止,我可以提取名称和颜色

file.text <- readLines("example.txt")

curve.names <- trimws(file.text[which(regexpr('Curve Name:', file.text) > 0) + 1])
curve.colors <- trimws(file.text[which(regexpr('Curve Color:', file.text) > 0) + 1])

假设每个文件的格式与上面的格式完全相同:

txt <- readLines("example.txt")
curve_name <- rep(trimws(txt[c(2,13)]), each=3)
curve_color <- rep(trimws(txt[c(10,21)]), each=3)
val <- read.table(text=paste(txt[c(6:8, 17:19)], collapse = "\n"))
colnames(val) <- c("index", "var1", "var2")
cbind(curve_name, curve_color, val)

通常有很多
grep
。找到一种对条目进行分组的方法,如空行的累计和,也很方便:

l <- readLines(textConnection('Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green '))

do.call(rbind, 
        lapply(split(trimws(l), cumsum(l == '')), function(x){
            data.frame(
                curve = x[grep('Curve Name:', x) + 1], 
                read.table(text = paste(x[(grep('index', x) + 2):(grep('Curve Color:', x) - 1)], 
                                        collapse = '\n'), 
                           col.names = c('index', 'variable.1', 'varible.2')))}))
##       curve index variable.1 varible.2
## 0.1 Curve A     0         30       100
## 0.2 Curve A     1         40        95
## 0.3 Curve A     2         50        90
## 1.1 Curve B     0         30       100
## 1.2 Curve B     1         40        90
## 1.3 Curve B     2         50        80

l将行读入
l
删除
曲线颜色之前的任何空格。(如果实际文件中
曲线颜色
之前没有空格,但问题中
曲线颜色
之前有空格,则可能不需要删除空格),然后重新读取以数字开头的行,创建
变量
data.frame。然后使用
read.dcf
读取
rest
,并使用
cbind
将两者放在一起

我们假设

  • 曲线值排在第二位,因此我们可以使用
    [,-2]
  • 只有数字表中的行以数字开头(以空格开头)
  • 每个数字记录有3列,列名称如问题所示。行的索引号以0开头,同一记录中的后续行的索引号也不为0。(每个数字表中的行数没有限制,不同的记录可能有不同的行数。)
  • 没有使用任何软件包

    L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
    variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE), 
     col.names = c("index", "variable.1", "variable.2"))
    rest <- trimws(read.dcf(textConnection(L))[, -2])
    cbind(rest[cumsum(variables$index == 0), ], variables)
    

    稍微不同的方法假设可预测的格式。我们获取每个“记录”,提取显著的组件并将它们绑定在一起

    library(purrr)
    library(stringi)
    
    starts <- which(grepl("Curve Name:", lines)) # find the start of each record
    ends <- which(grepl("Curve Color:", lines))+1  # find the end of each record
    
    map2_df(starts, ends, function(start, end) {
    
      rec <- paste0(lines[start:(end)], collapse="\n") # extract the record
    
      # regex extract each set of values
      stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
                                    "Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
                                    "Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%  
        trimws() -> found
    
        df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
        df$curve.name <- found[1]
        df$color <- found[3]
        df
    
    })
    ##   index variable.1 variable.2 curve.name color
    ## 1     0         30        100    Curve A  Blue
    ## 2     1         40         95    Curve A  Blue
    ## 3     2         50         90    Curve A  Blue
    ## 4     0         30        100    Curve B Green
    ## 5     1         40         90    Curve B Green
    ## 6     2         50         80    Curve B Green
    
    库(purrr)
    图书馆(stringi)
    
    开始很好的解决方案@hrbrmstr你为什么用
    trimws
    而不是
    stringi::stri_trim_两者
    ?我会责怪它在美国东部时间23:30键入的简洁:-)我对所有答案都投了赞成票;但是,之所以选择此选项,是因为它处理可变长度曲线,而不需要额外的包。“Curve Values:”行的累积总和适用于我的问题。根据poster的评论,不同的记录可能代表数字表中不同的行数,已修改代码以允许此操作。此外,还进行了一些简化,使代码不再比以前长。
    L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
    variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE), 
     col.names = c("index", "variable.1", "variable.2"))
    rest <- trimws(read.dcf(textConnection(L))[, -2])
    cbind(rest[cumsum(variables$index == 0), ], variables)
    
      Curve Name Curve Color index variable.1 variable.2
    1    Curve A        Blue     0         30        100
    2    Curve A        Blue     1         40         95
    3    Curve A        Blue     2         50         90
    4    Curve B       Green     0         30        100
    5    Curve B       Green     1         40         90
    6    Curve B       Green     2         50         80
    
    library(purrr)
    library(stringi)
    
    starts <- which(grepl("Curve Name:", lines)) # find the start of each record
    ends <- which(grepl("Curve Color:", lines))+1  # find the end of each record
    
    map2_df(starts, ends, function(start, end) {
    
      rec <- paste0(lines[start:(end)], collapse="\n") # extract the record
    
      # regex extract each set of values
      stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
                                    "Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
                                    "Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%  
        trimws() -> found
    
        df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
        df$curve.name <- found[1]
        df$color <- found[3]
        df
    
    })
    ##   index variable.1 variable.2 curve.name color
    ## 1     0         30        100    Curve A  Blue
    ## 2     1         40         95    Curve A  Blue
    ## 3     2         50         90    Curve A  Blue
    ## 4     0         30        100    Curve B Green
    ## 5     1         40         90    Curve B Green
    ## 6     2         50         80    Curve B Green