r-导入数据，列是字符空间_R_Import

r-导入数据，列是字符空间

r import

r-导入数据，列是字符空间,r,import,R,Import,我有一个包含数据的文件，我需要将其导入到数据框中，但该文件的设置非常糟糕我试图导入的文件是一个344个字符的列表（32列，445k行）。每列都是特定范围的字符空间第1列是字符空间1:2 第2列是字符空间3:6 第3列是字符空间7:20 等等数据示例： the.data <- list("32154The street", "12546The clouds", "23236The jungle") 我所尝试的： substr(the.data, 1,2) substr(the.dat

我有一个包含数据的文件，我需要将其导入到数据框中，但该文件的设置非常糟糕

我试图导入的文件是一个344个字符的列表（32列，445k行）。每列都是特定范围的字符空间

第1列是字符空间1:2

第2列是字符空间3:6

第3列是字符空间7:20 等等

数据示例：

the.data <- list("32154The street", "12546The clouds", "23236The jungle")

我所尝试的：

substr(the.data, 1,2)
substr(the.data, 3,6)
substr(the.data, 7,20)

把它绑在一起

我想找到一个更好的解决办法

我还尝试在正确的字符空间插入逗号，将其导出为csv并重新导入（或使用textConnection），但遇到了问题。

一个选项是使用

sub

在未列出的数据中插入分隔符，然后使用

read.csv/read.table

read.csv(text=sub("^(\\d{2})(\\d{3})(.*)", "\\1,\\2,\\3", 
    unlist(the.data)), header = FALSE, 
       col.names = paste0("col", 1:3), stringsAsFactors = FALSE)
#   col1 col2       col3
#1   32  154 The street
#2   12  546 The clouds
#3   23  236 The jungle

或者我们可以根据位置使用

分离
library(dplyr)
library(tidyr)
unlist(the.data) %>%
      as_tibble %>%
      separate(value, into = paste0("col", 1:3), sep= c(3, 5))
# A tibble: 3 x 3
#   col1  col2  col3      
#* <chr> <chr> <chr>     
#1 321   54    The street
#2 125   46    The clouds
#3 232   36    The jungle

库（dplyr）
图书馆（tidyr）
未列出（数据）%>%
可存储%>%
分离（数值，分为=0（“col”，1:3），sep=c（3,5））
#一个tibble:3x3
#col1 col2 col3
#*        
#132154街
#212546云层
#3.丛林
像这样的东西
> library(stringr)
> data.frame(col1=str_sub(the.data,1,2),col2=str_sub(the.data,3,5),col3=str_sub(the.data,6,-1))
  col1 col2       col3
1   32  154 The street
2   12  546 The clouds
3   23  236 The jungle

tidyverse
中的readr
可以读取固定宽度的数据
library('tidyverse')

read_fwf(paste(the.data, collapse='\n'), fwf_widths(c(2,3,15)))
#> # A tibble: 3 x 3
#>      X1    X2         X3
#>   <int> <int>      <chr>
#> 1    32   154 The street
#> 2    12   546 The clouds
#> 3    23   236 The jungle

library（'tidyverse'））
read_fwf（粘贴（the.data，collapse='\n'），fwf_宽度（c（2,3,15）））
#>#tibble:3 x 3
#>x1x2x3
#>          
#>132154街
#>21546云层
#>32336丛林
这很好。我打算建议read.fwf（ff，widths=c（3，2，15））很好的解决方案，我可以直接从文件中读取它。我目前唯一的问题是编码
library('tidyverse')

read_fwf(paste(the.data, collapse='\n'), fwf_widths(c(2,3,15)))
#> # A tibble: 3 x 3
#>      X1    X2         X3
#>   <int> <int>      <chr>
#> 1    32   154 The street
#> 2    12   546 The clouds
#> 3    23   236 The jungle