在R中,如何从分割数据的文本文件创建数据帧?
在R中,我试图导入一个具有以下结构的海量文本文件:这是一个保存为example.txt的示例:在R中,如何从分割数据的文本文件创建数据帧?,r,import,R,Import,在R中,我试图导入一个具有以下结构的海量文本文件:这是一个保存为example.txt的示例: Curve Name: Curve A Curve Values: index Variable 1 Variable 2 [°C] [%] 0 30 100 1 40 95 2
Curve Name:
Curve A
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 95
2 50 90
Curve Color:
Blue
Curve Name:
Curve B
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 90
2 50 80
Curve Color:
Green
到目前为止,我可以提取名称和颜色
file.text <- readLines("example.txt")
curve.names <- trimws(file.text[which(regexpr('Curve Name:', file.text) > 0) + 1])
curve.colors <- trimws(file.text[which(regexpr('Curve Color:', file.text) > 0) + 1])
假设每个文件的格式与上面的格式完全相同:
txt <- readLines("example.txt")
curve_name <- rep(trimws(txt[c(2,13)]), each=3)
curve_color <- rep(trimws(txt[c(10,21)]), each=3)
val <- read.table(text=paste(txt[c(6:8, 17:19)], collapse = "\n"))
colnames(val) <- c("index", "var1", "var2")
cbind(curve_name, curve_color, val)
通常有很多
grep
。找到一种对条目进行分组的方法,如空行的累计和,也很方便:
l <- readLines(textConnection('Curve Name:
Curve A
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 95
2 50 90
Curve Color:
Blue
Curve Name:
Curve B
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 90
2 50 80
Curve Color:
Green '))
do.call(rbind,
lapply(split(trimws(l), cumsum(l == '')), function(x){
data.frame(
curve = x[grep('Curve Name:', x) + 1],
read.table(text = paste(x[(grep('index', x) + 2):(grep('Curve Color:', x) - 1)],
collapse = '\n'),
col.names = c('index', 'variable.1', 'varible.2')))}))
## curve index variable.1 varible.2
## 0.1 Curve A 0 30 100
## 0.2 Curve A 1 40 95
## 0.3 Curve A 2 50 90
## 1.1 Curve B 0 30 100
## 1.2 Curve B 1 40 90
## 1.3 Curve B 2 50 80
l将行读入l
删除曲线颜色之前的任何空格。(如果实际文件中曲线颜色
之前没有空格,但问题中曲线颜色
之前有空格,则可能不需要删除空格),然后重新读取以数字开头的行,创建变量
data.frame。然后使用read.dcf
读取rest
,并使用cbind
将两者放在一起
我们假设
曲线值排在第二位,因此我们可以使用[,-2]
只有数字表中的行以数字开头(以空格开头)
每个数字记录有3列,列名称如问题所示。行的索引号以0开头,同一记录中的后续行的索引号也不为0。(每个数字表中的行数没有限制,不同的记录可能有不同的行数。)
没有使用任何软件包
L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE),
col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)
稍微不同的方法假设可预测的格式。我们获取每个“记录”,提取显著的组件并将它们绑定在一起
library(purrr)
library(stringi)
starts <- which(grepl("Curve Name:", lines)) # find the start of each record
ends <- which(grepl("Curve Color:", lines))+1 # find the end of each record
map2_df(starts, ends, function(start, end) {
rec <- paste0(lines[start:(end)], collapse="\n") # extract the record
# regex extract each set of values
stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
"Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
"Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%
trimws() -> found
df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
df$curve.name <- found[1]
df$color <- found[3]
df
})
## index variable.1 variable.2 curve.name color
## 1 0 30 100 Curve A Blue
## 2 1 40 95 Curve A Blue
## 3 2 50 90 Curve A Blue
## 4 0 30 100 Curve B Green
## 5 1 40 90 Curve B Green
## 6 2 50 80 Curve B Green
库(purrr)
图书馆(stringi)
开始很好的解决方案@hrbrmstr你为什么用trimws
而不是stringi::stri_trim_两者
?我会责怪它在美国东部时间23:30键入的简洁:-)我对所有答案都投了赞成票;但是,之所以选择此选项,是因为它处理可变长度曲线,而不需要额外的包。“Curve Values:”行的累积总和适用于我的问题。根据poster的评论,不同的记录可能代表数字表中不同的行数,已修改代码以允许此操作。此外,还进行了一些简化,使代码不再比以前长。
L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE),
col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)
Curve Name Curve Color index variable.1 variable.2
1 Curve A Blue 0 30 100
2 Curve A Blue 1 40 95
3 Curve A Blue 2 50 90
4 Curve B Green 0 30 100
5 Curve B Green 1 40 90
6 Curve B Green 2 50 80
library(purrr)
library(stringi)
starts <- which(grepl("Curve Name:", lines)) # find the start of each record
ends <- which(grepl("Curve Color:", lines))+1 # find the end of each record
map2_df(starts, ends, function(start, end) {
rec <- paste0(lines[start:(end)], collapse="\n") # extract the record
# regex extract each set of values
stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
"Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
"Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%
trimws() -> found
df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
df$curve.name <- found[1]
df$color <- found[3]
df
})
## index variable.1 variable.2 curve.name color
## 1 0 30 100 Curve A Blue
## 2 1 40 95 Curve A Blue
## 3 2 50 90 Curve A Blue
## 4 0 30 100 Curve B Green
## 5 1 40 90 Curve B Green
## 6 2 50 80 Curve B Green