在R中，如何从分割数据的文本文件创建数据帧？_R_Import

在R中，如何从分割数据的文本文件创建数据帧？

r import

在R中，如何从分割数据的文本文件创建数据帧？,r,import,R,Import,在R中，我试图导入一个具有以下结构的海量文本文件：这是一个保存为example.txt的示例： Curve Name: Curve A Curve Values: index Variable 1 Variable 2 [°C] [%] 0 30 100 1 40 95 2

在R中，我试图导入一个具有以下结构的海量文本文件：这是一个保存为example.txt的示例：

Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green

到目前为止，我可以提取名称和颜色

file.text <- readLines("example.txt")

curve.names <- trimws(file.text[which(regexpr('Curve Name:', file.text) > 0) + 1])
curve.colors <- trimws(file.text[which(regexpr('Curve Color:', file.text) > 0) + 1])

假设每个文件的格式与上面的格式完全相同：

txt <- readLines("example.txt")
curve_name <- rep(trimws(txt[c(2,13)]), each=3)
curve_color <- rep(trimws(txt[c(10,21)]), each=3)
val <- read.table(text=paste(txt[c(6:8, 17:19)], collapse = "\n"))
colnames(val) <- c("index", "var1", "var2")
cbind(curve_name, curve_color, val)

通常有很多

grep

。找到一种对条目进行分组的方法，如空行的累计和，也很方便：

l <- readLines(textConnection('Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green '))

do.call(rbind, 
        lapply(split(trimws(l), cumsum(l == '')), function(x){
            data.frame(
                curve = x[grep('Curve Name:', x) + 1], 
                read.table(text = paste(x[(grep('index', x) + 2):(grep('Curve Color:', x) - 1)], 
                                        collapse = '\n'), 
                           col.names = c('index', 'variable.1', 'varible.2')))}))
##       curve index variable.1 varible.2
## 0.1 Curve A     0         30       100
## 0.2 Curve A     1         40        95
## 0.3 Curve A     2         50        90
## 1.1 Curve B     0         30       100
## 1.2 Curve B     1         40        90
## 1.3 Curve B     2         50        80

l将行读入l
删除曲线颜色之前的任何空格。（如果实际文件中曲线颜色
之前没有空格，但问题中曲线颜色
之前有空格，则可能不需要删除空格），然后重新读取以数字开头的行，创建变量
data.frame。然后使用read.dcf
读取rest
，并使用cbind
将两者放在一起
我们假设
曲线值排在第二位，因此我们可以使用[，-2]
只有数字表中的行以数字开头（以空格开头）
每个数字记录有3列，列名称如问题所示。行的索引号以0开头，同一记录中的后续行的索引号也不为0。（每个数字表中的行数没有限制，不同的记录可能有不同的行数。）
没有使用任何软件包
L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE), 
 col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)

稍微不同的方法假设可预测的格式。我们获取每个“记录”，提取显著的组件并将它们绑定在一起
library(purrr)
library(stringi)

starts <- which(grepl("Curve Name:", lines)) # find the start of each record
ends <- which(grepl("Curve Color:", lines))+1  # find the end of each record

map2_df(starts, ends, function(start, end) {

  rec <- paste0(lines[start:(end)], collapse="\n") # extract the record

  # regex extract each set of values
  stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
                                "Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
                                "Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%  
    trimws() -> found

    df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
    df$curve.name <- found[1]
    df$color <- found[3]
    df

})
##   index variable.1 variable.2 curve.name color
## 1     0         30        100    Curve A  Blue
## 2     1         40         95    Curve A  Blue
## 3     2         50         90    Curve A  Blue
## 4     0         30        100    Curve B Green
## 5     1         40         90    Curve B Green
## 6     2         50         80    Curve B Green

库（purrr）
图书馆（stringi）
开始很好的解决方案@hrbrmstr你为什么用trimws
而不是stringi：：stri_trim_两者？我会责怪它在美国东部时间23:30键入的简洁：-）我对所有答案都投了赞成票；但是，之所以选择此选项，是因为它处理可变长度曲线，而不需要额外的包。“Curve Values:”行的累积总和适用于我的问题。根据poster的评论，不同的记录可能代表数字表中不同的行数，已修改代码以允许此操作。此外，还进行了一些简化，使代码不再比以前长。
L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE), 
 col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)

  Curve Name Curve Color index variable.1 variable.2
1    Curve A        Blue     0         30        100
2    Curve A        Blue     1         40         95
3    Curve A        Blue     2         50         90
4    Curve B       Green     0         30        100
5    Curve B       Green     1         40         90
6    Curve B       Green     2         50         80

library(purrr)
library(stringi)

starts <- which(grepl("Curve Name:", lines)) # find the start of each record
ends <- which(grepl("Curve Color:", lines))+1  # find the end of each record

map2_df(starts, ends, function(start, end) {

  rec <- paste0(lines[start:(end)], collapse="\n") # extract the record

  # regex extract each set of values
  stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
                                "Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
                                "Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%  
    trimws() -> found

    df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
    df$curve.name <- found[1]
    df$color <- found[3]
    df

})
##   index variable.1 variable.2 curve.name color
## 1     0         30        100    Curve A  Blue
## 2     1         40         95    Curve A  Blue
## 3     2         50         90    Curve A  Blue
## 4     0         30        100    Curve B Green
## 5     1         40         90    Curve B Green
## 6     2         50         80    Curve B Green