在r中导入不规则数据_R_Import_Format

在r中导入不规则数据

r import

在r中导入不规则数据,r,import,format,R,Import,Format,我希望有人能帮我解决一个数据导入问题，我认为这可能是一个简单的解决办法，但还没有找到答案。我有大量包含天线扫描的txt文件，需要以统一配置导入它们。问题在于，在实际数据开始之前，它们都包含关于天线的不规则数量的诊断数据行。我需要一个能够识别实际数据何时开始的函数，这样我就可以在正确的列中使用正确的数据导入它。基本上，对于每个文件，我需要确定诊断代码的行数，以便在使用read.delim或类似内容输入文件时指定skip=”“ 下面是我正在谈论的其中一个文件的示例： Power OFF @ 12:0

我希望有人能帮我解决一个数据导入问题，我认为这可能是一个简单的解决办法，但还没有找到答案。我有大量包含天线扫描的txt文件，需要以统一配置导入它们。问题在于，在实际数据开始之前，它们都包含关于天线的不规则数量的诊断数据行。我需要一个能够识别实际数据何时开始的函数，这样我就可以在正确的列中使用正确的数据导入它。基本上，对于每个文件，我需要确定诊断代码的行数，以便在使用read.delim或类似内容输入文件时指定skip=”“

下面是我正在谈论的其中一个文件的示例：

Power OFF @ 12:05:50 02/15/13 
Power ON  @ 12:06:03 02/15/13 
Reader #1 12:06:03 02/15/13 

Reader #2 12:06:03 02/15/13 

Battery Voltage = 13.35 @ 13:00:00 02/15/13 
Battery Voltage = 13.42 @ 14:00:00 02/15/13 
Battery Voltage = 13.32 @ 15:00:00 02/15/13 
Battery Voltage = 13.55 @ 16:00:00 02/15/13 

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

Reader #2 02:57:40 02/17/13 LA 900 226000012999

您是否总是在最后一行之后查找带有“蓄电池电压”的第一行？如果是，请尝试以下方法：

the.file <- readLines("C:\\Users\\myfile.txt")
row.to.begin.skip.at <- tail(grep("Battery Voltage", the.file), 1)

the.file您可以将该文件作为一个文本块读取，并使用grep
确定要删除的行。在这里，我将您的文本块存储在test.txt
中。假设您的标题一直到电池电压
部分，您可以首先识别包含电池
的行号，然后找到它的最后一个实例。这将是要跳过的行数
con = file('test.txt', 'r')
text = readLines(con)
close(con)

lines_to_skip = max(grep('Battery',text))    

然后你应该很好地读取数据
> x = read.table('test.txt', skip=lines_to_skip, sep=' ', comment.char='')
> x
  V1     V2       V3       V4 V5  V6       V7
1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

轻微的变化。这将返回以Reader开头并包含7个元素（以空格分隔）的每一行。我注意到前两个读者行较短，如果情况并非总是如此，那么这当然不起作用
antenna0 <- readLines("antenna.txt")
antenna0 <- antenna0[grep("^Reader", antenna0)]
antenna <- strsplit(antenna0, " ")
data.frame(do.call(rbind, antenna[sapply(antenna, length) == 7]))

#      X1 X2       X3       X4 X5  X6           X7
#1 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#2 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#3 Reader #2 02:57:40 02/17/13 LA 900 226000012999
#4 Reader #2 02:57:40 02/17/13 LA 900 226000012999

简短解释者：
[\\s{4，}]
表示将返回任何包含四个或四个以上空格（\\s
）的字符串（{4，}
）
^Reader
表示将返回以字母序列读取器开头的任何字符串
*
将这两种模式结合在一起，作为AND运算符使用。
read.table
如果使用readLines
逐行读取文本，则可以使用grep
搜索与“蓄电池电压”匹配的最高行号，并将其用于skip

read.table(file.txt, 
           skip = max(grep('Battery Voltage', readLines(file.txt))), 
           # set comment delimiting character to anything besides "#"
           comment.char = '')
##       V1 V2       V3       V4 V5  V6       V7
## 1 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 3 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

请注意，需要进一步清理（合并列、格式化日期）

read.fwf
如果列宽一致，使用read.fwf
（fixedwidthfile）可能更有意义。您需要使用na.ommit
，complete.cases
，或其他一些消除空行的方法，因为read.fwf
不接受空行。跳过read.table
等参数及其变体：
na.omit(read.fwf(file.txt, 
                 widths = c(9, -1, 17, -1, 2, -1, 3, -1, 12), 
                 skip = max(grep('Battery Voltage', readLines(file.txt))), 
                 comment.char = ''))
##          V1                V2 V3  V4       V5
## 2 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 4 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 6 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11
## 8 Reader #2 02:57:40 02/17/13 LA 900 2.26e+11

然而，计算字符数来计算列宽是一件痛苦的事情（而且容易出错）

readr:：read_fwf
readr
软件包使处理固定宽度的文件稍微不那么烦人，并在解析不理想时提供有用的警告。它还提供参数，用于在读取数据时解析日期和日期时间，这非常方便：
library(readr)

df <- read_fwf(file.txt, 
               fwf_widths(c(9, 18, 3, 4, NA)), 
               col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd'), 
               skip = max(grep('Battery Voltage', readLines(file.txt))))

df <- df[complete.cases(df), ]
# or df <- na.omit(df)
# or if some NAs are possible, more robust:
# df <- df[colSums(!apply(df, 1, is.na)) > 0, ]

df
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

不过，这种方法有点不稳定，因此只有在您能够验证它是否正常工作时才应使用。
那么您需要从该文件中获取哪些数据蓄电池电压读取后开始的数据，即读卡器、时间、日期、标签号关于rl谢谢，在这种情况下确实有效。但是我有数千个这样的文件，而且诊断并不总是仅仅是电池电压读数（只是碰巧是我粘贴的那个文件的开始）。有很多诊断读数，所以我希望找到一些东西来识别我想要开始的数据模式。谢谢你的意见，我真的很感激。在这种情况下，你可以使用同样的方法，但不要对你不想要的东西使用grep
，而是用它来找到你想要的东西。例如，您对带有Reader的行感兴趣，因为它们包含您想要的数据，但在您的示例中，似乎有一些行在标题中以Reader开头。要去除这些元素，可以在每一行上使用strsplit
，并计算元素的数量。如果它有7个元素，您就知道这是您的主要数据。如果它只有4个，那么它是垃圾，所以现在还不要启动您的read.table。谢谢您的输入。在我的例子中，我实际上是在试图找出一些能够识别我想要的数据模式的东西。这是因为除蓄电池电压外，各种文件中还有许多诊断读数。
library(readr)

df <- read_fwf(file.txt, 
               fwf_widths(c(9, 18, 3, 4, NA)), 
               col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd'), 
               skip = max(grep('Battery Voltage', readLines(file.txt))))

df <- df[complete.cases(df), ]
# or df <- na.omit(df)
# or if some NAs are possible, more robust:
# df <- df[colSums(!apply(df, 1, is.na)) > 0, ]

df
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11

na.omit(read_fwf(file.txt, 
                 fwf_widths(c(9, 18, 3, 4, 13)), 
                 col_types = list('c', col_datetime('%H:%M:%S %m/%d/%y'),'c', 'i', 'd')))
## # A tibble: 4 x 5
##          X1                  X2    X3    X4       X5
##       <chr>              <time> <chr> <int>    <dbl>
## 1 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 2 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 3 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11
## 4 Reader #2 2013-02-17 02:57:40    LA   900 2.26e+11