使用dplyr'将纯文本数据重塑为常规表格数据的任何方法；公用事业？_R_Dplyr_Reshape_Data Manipulation

使用dplyr'将纯文本数据重塑为常规表格数据的任何方法；公用事业？

使用dplyr'将纯文本数据重塑为常规表格数据的任何方法；公用事业？,r,dplyr,reshape,data-manipulation,R,Dplyr,Reshape,Data Manipulation,我用纯文本ASCII格式（）对数据进行了网格化处理，所有的数据观察都是在每日水平上进行的，每年的数据都是在网格中收集的。然而，我想重建这些数据，因为我想做年度统计。要做到这一点，我需要在矩阵中重建这些纯文本数据，就像在表格数据中一样，每天的数据观察将在新的列中，这样做将更容易实现年平均值更新：因为原始原始纯文本数据（（））相当大，所以这里我只是简单介绍一下原始数据更新2：我在ASCII中将原始普通数据导入R，下面是R脚本： rawdata = read.table(file = "~/2

我用纯文本

ASCII

格式（）对数据进行了网格化处理，所有的数据观察都是在每日水平上进行的，每年的数据都是在网格中收集的。然而，我想重建这些数据，因为我想做年度统计。要做到这一点，我需要在矩阵中重建这些纯文本数据，就像在表格数据中一样，每天的数据观察将在新的列中，这样做将更容易实现年平均值

更新：

因为原始原始纯文本数据（（））相当大，所以这里我只是简单介绍一下原始数据

更新2：

我在

ASCII

中将原始普通数据导入R，下面是R脚本：

rawdata = read.table(file = "~/25_krig_all_1980", header = FALSE, fill = TRUE, comment.char="Y", stringsAsFactors=FALSE )
colnames(rawdata) = c("long", "lat", "precip", "err1", "err2")

以下是原始纯文本数据的骨架在

notepad++

中的样子：

1980   1   1   1
      6.125 47.375     0.0    20.00     1.0
      6.375 47.375     0.0    19.99     1.0
      6.625 47.375     0.0    19.97     1.0
      6.875 47.375     0.0    19.84     1.0
      7.125 47.375     0.0    20.00     1.0
 1980   1   2   2
      6.125 47.375     1.5    20.00     1.0
      6.375 47.375     1.5    19.99     1.0
      6.625 47.375     1.5    19.97     1.0
      6.875 47.375     1.5    19.84     1.0
      7.125 47.375     2.9    20.00     1.0
 1980   1   3   3
      6.125 47.375     3.3    20.00     1.0
      6.375 47.375     3.3    19.99     1.0
      6.625 47.375     3.3    19.97     1.0
      6.875 47.375     3.3    19.84     1.0
      7.125 47.375     1.3    20.00     1.0
 1980   1   4   4
      6.125 47.375     3.8    20.00     1.0
      6.375 47.375     3.8    19.99     1.0
      6.625 47.375     3.8    19.97     1.0
      6.875 47.375     3.7    19.84     1.0
      7.125 47.375     3.7    20.00     1.0
 1980   1   5   5
      6.125 47.375     2.2    20.00     1.0
      6.375 47.375     2.2    19.99     1.0
      6.625 47.375     2.2    19.97     1.0
      6.875 47.375     2.2    19.84     1.0
      7.125 47.375     4.8    20.00     1.0

下面是我解释原始原始纯文本数据的最小示例：

foo = read.table("grid_data_demo.txt", header=FALSE, skip=1, nrows = 5)
colnames(foo) = c("long", "lat", "precip", "err1", "err2")

更新3：

在原始纯文本数据中，并没有文本分隔符，也并没有将所有数据放置在纯文本中的列表。我创建了

miniDat

作为一个可复制的示例，因为我想从原始原始数据（（））中获得一个类似列表的对象

所以我想从原始的纯文本数据中重建类似于表格数据的矩阵，并分别对每个网格点进行年度统计。也许，

dplyr

或

data.table

提供了处理此类操作的实用程序。是否有任何快速解决方案来执行此数据转换？如何在

dplyr

实用程序中轻松实现这一点？有什么想法吗

所需输出：

在我的预期输出中，我希望删除

第四列（err1）

和

第五列（err2）

列，同时保持

long

和

lat

列的相同维度，并将相应的daily

precip

值作为新列。以下是我预期输出的可复制示例：

desired_output = data.frame(
    long=c( 6.125 ,6.375, 6.625, 6.875, 7.125),
    lat=c(47.375, 47.375, 47.375, 47.375, 47.375),
    precip_day1=c(0, 0, 0, 0, 0),
    precip_day2=c(1.5, 1.5, 1.5, 1.5, 2.9),
    precip_day3=c(3.3, 3.3, 3.3, 3.3, 1.3),
    precip_day4=c(3.8, 3.8, 3.8, 3.7, 3.7),
    precip_day5=c(2.2, 2.2, 2.2, 2.2, 4.8)
)

基本上，我希望简化原始数据，并将其重建为类似于表格数据的矩阵，以便更轻松地计算每个网格坐标的年平均值

precip

。为了简化和提高效率，在我预期的最终输出中，我希望分别有所有

long

、

lat

和

annual\u mnu precip

列

如何在R中实现数据简化和转换？有更简单的方法吗？谢谢

您可以使用

readLines

将原始文本文件读入到文件中的行向量中。然后，您可以确定哪些行包含日期，哪些行包含观察结果（基于本例中的缩进）；将它们读入单独的数据帧；并根据包含日期的行的索引组合数据帧。下面是执行此操作的代码：

parse_weather <- function(file) {
  lines <- readLines(file)

  # Indicators for whether a line contains a date or an observation
  date_lines <- !startsWith(lines, " ")
  data_lines <- !date_lines

  # Number of observations for each date
  nobs <- diff(c(which(date_lines), length(lines) + 1)) - 1

  dates <- read.table(
    # repeat date for each observation
    text = paste(rep(lines[date_lines], nobs), collapse = "\n"),
    col.names = c("year", "month", "day", "days")
  )

  observations <- read.table(
    text = paste(lines[data_lines], collapse = "\n"),
    col.names = c("long", "lat", "precip", "err1", "err2")
  )

  cbind(dates, observations)
}

# I saved the example data snippet as a local text file
weather <- parse_weather("weather.txt")
head(weather, 8)
#>   year month day days  long    lat precip  err1 err2
#> 1 1980     1   1    1 6.125 47.375    0.0 20.00    1
#> 2 1980     1   1    1 6.375 47.375    0.0 19.99    1
#> 3 1980     1   1    1 6.625 47.375    0.0 19.97    1
#> 4 1980     1   1    1 6.875 47.375    0.0 19.84    1
#> 5 1980     1   1    1 7.125 47.375    0.0 20.00    1
#> 6 1980     1   2    2 6.125 47.375    1.5 20.00    1
#> 7 1980     1   2    2 6.375 47.375    1.5 19.99    1
#> 8 1980     1   2    2 6.625 47.375    1.5 19.97    1

parse_weather 7 1980 1 2 6.375 47.375 1.5 19.99 1
#> 8 1980     1   2    2 6.625 47.375    1.5 19.97    1

以这种导入策略留下的长格式处理这些数据可能更容易。但是，如果您希望每天都有一个列，您可以通过使用例如

tidyr:：spread

或

rehsape2:：dcast

来重塑数据来实现

Edit:结果是

read.table

速度非常慢，以

text

参数形式给出了大向量输入。将

行

向量粘贴到单个字符串中可以大大加快大型文件的处理速度：我相应地更新了答案。

此问题的显著特点是：

每天一个标题记录，每天观察/详细记录的数量可变

不同的详细信息观察行不包括将标题链接到详细信息的键

标题记录有4列，详细记录有5列

因为一个经度坐标在小数点左边最多有3位数字，所以我们无法分析第一列中的空白记录来区分标题记录和细节记录

读取此文件并将标题信息与详细信息对齐的最直接的方法是利用文本处理来重塑文件，使其每个记录包含一个观察值。一旦对原始数据进行了整形，就可以使用

read.table（）

轻松读取

所需的转换可以通过组合使用

readLines（）

和

lappy（）

在base R中完成

这种方法还避免了为了将标题记录与正确数量的细节记录合并而必须跟踪每天的观察次数

更新：提高解决方案的性能根据对此答案的注释，脚本需要相当长的时间来针对OP中引用的完整数据执行。原始数据文件有407705行：365条标题记录和407340条详细记录。上述解决方案通过以下配置在MacBook Pro上转换数据并在大约155秒内将其加载到数据帧中

操作系统：OS X Yosemite 10.10.4（14E46）
处理器：英特尔i5，2.6Ghz，turbo高达3.3Ghz，双核
内存：8千兆字节
磁盘：512 GB，固态驱动器
建造日期：2013年4月

性能缓慢的原因与向该职位提供的其他答案相比，有两个潜在的缓慢来源，包括：

使用字符串函数

gsub（）

和

strsplit（）

，其中一个函数生成字符串列表作为其输出

在循环中使用

cat（…，append=TRUE）

，这意味着R必须打开文件，导航到末尾，并添加内容超过400000次

性能优化我们通过以下方式调整代码以提高其性能

使用了

readr

库f

parse_weather <- function(file) {
  lines <- readLines(file)

  # Indicators for whether a line contains a date or an observation
  date_lines <- !startsWith(lines, " ")
  data_lines <- !date_lines

  # Number of observations for each date
  nobs <- diff(c(which(date_lines), length(lines) + 1)) - 1

  dates <- read.table(
    # repeat date for each observation
    text = paste(rep(lines[date_lines], nobs), collapse = "\n"),
    col.names = c("year", "month", "day", "days")
  )

  observations <- read.table(
    text = paste(lines[data_lines], collapse = "\n"),
    col.names = c("long", "lat", "precip", "err1", "err2")
  )

  cbind(dates, observations)
}

# I saved the example data snippet as a local text file
weather <- parse_weather("weather.txt")
head(weather, 8)
#>   year month day days  long    lat precip  err1 err2
#> 1 1980     1   1    1 6.125 47.375    0.0 20.00    1
#> 2 1980     1   1    1 6.375 47.375    0.0 19.99    1
#> 3 1980     1   1    1 6.625 47.375    0.0 19.97    1
#> 4 1980     1   1    1 6.875 47.375    0.0 19.84    1
#> 5 1980     1   1    1 7.125 47.375    0.0 20.00    1
#> 6 1980     1   2    2 6.125 47.375    1.5 20.00    1
#> 7 1980     1   2    2 6.375 47.375    1.5 19.99    1
#> 8 1980     1   2    2 6.625 47.375    1.5 19.97    1

inFile <- "./data/tempdata1980.txt"
outputFile <- "./data/tempData.txt"
# delete output file if it already exists
if (file.exists(outputFile)) file.remove(outputFile)
theText <- readLines(inFile)
header <- NULL # scope to retain header across executions of lapply()
theResult <- lapply(theText,function(x){
     # reduce blanks to 1 between tokens 
     aRow <- unlist(strsplit(gsub("^ *|(?<= ) | *$", "", x, perl = TRUE)," "))
     # use <<- form of assignment operator to set to parent of if() environment 
     if (length(aRow) == 4) header <<- x
     else {
          cat(paste(header,x),file=outputFile,
              sep="\n",append=TRUE)
     }
})
# now read with read.table
colNames <- c("year","month","day","dayOfYear","long","lat","precip","err1","err2")
theData <- read.table(outputFile,header=FALSE,col.names = colNames)

> head(theData)
  year month day dayOfYear  long    lat precip  err1 err2
1 1980     1   1         1 6.125 47.375    0.0 20.00    1
2 1980     1   1         1 6.375 47.375    0.0 19.99    1
3 1980     1   1         1 6.625 47.375    0.0 19.97    1
4 1980     1   1         1 6.875 47.375    0.0 19.84    1
5 1980     1   1         1 7.125 47.375    0.0 20.00    1
6 1980     1   2         2 6.125 47.375    1.5 20.00    1
>

inFile <- "./data/25_krig_all_1980.txt"
outputFile <- "./data/tempData.txt"
if (file.exists(outputFile)) file.remove(outputFile)
library(readr)
system.time(theText <- readLines(inFile))
#   user  system elapsed 
#  1.821   0.027   1.859 

header <- NULL # scope to retain header across executions of lapply()
outVector <- NULL
i <- 1 
system.time(theResult <- lapply(theText,function(x){
     # reduce blanks to 1 between tokens 
     aRow <- unlist(strsplit(gsub("^ *|(?<= ) | *$", "", x, perl = TRUE)," "))
     # use <<- form of assignment operator to set to parent of if() environment 
     if (length(aRow) == 4) header <<- x
     else {
          outVector[i] <<- paste(header,x)
          i <<- i + 1
     }
}))
#   user  system elapsed 
# 19.327   0.085  19.443 

# write to file
system.time(write_lines(outVector,outputFile))
#   user  system elapsed 
#  0.079   0.020   0.117 

# now read with read.table
colNames <- c("year","month","day","dayOfYear","long","lat","precip","err1","err2")
system.time(theData <- read_table2(outputFile,col_names = colNames))
#  user  system elapsed 
# 0.559   0.071   0.794

header <- NULL # scope to retain header across executions of lapply()
outVector <- NULL
i <- 1
system.time(theResult <- lapply(theText,function(x){
     # use <<- form of assignment operator to set to parent of if() environment 
     if (substr(x,1,1) != " ") header <<- x
     else {
          outVector[i] <<- paste(header,x)
          i <<- i + 1
     }
}))
#   user  system elapsed 
#  2.840   0.080   2.933