Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/arrays/14.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Arrays 在R data.table中快速操作字符数组_Arrays_R_Data.table_Readr - Fatal编程技术网

Arrays 在R data.table中快速操作字符数组

Arrays 在R data.table中快速操作字符数组,arrays,r,data.table,readr,Arrays,R,Data.table,Readr,我有一个巨大的字符向量数据集(14GB,200MN行)。我已经安装了它(在48核128 GB服务器上花费了30分钟以上)。该字符串包含有关各种字段的连接信息。例如,我的表的第一行如下所示: 2014120900000001091500bbbbcompany_name00032401 其中前8个字符以YYYYMMDD格式表示日期,后8个字符为id,后6个字符为HHMMSS格式的时间,后16个字符为名称(前缀为b),后8个字符为价格(小数位2) 我需要将上述1列data.table转换为5列:日期

我有一个巨大的字符向量数据集(14GB,200MN行)。我已经安装了它(在48核128 GB服务器上花费了30分钟以上)。该字符串包含有关各种字段的连接信息。例如,我的表的第一行如下所示:

2014120900000001091500bbbbcompany_name00032401
其中前8个字符以YYYYMMDD格式表示日期,后8个字符为id,后6个字符为HHMMSS格式的时间,后16个字符为名称(前缀为b),后8个字符为价格(小数位2)

我需要将上述1列data.table转换为5列:
日期、id、时间、名称、价格

对于上面的字符向量,结果是:
date=“2014-12-09”,id=1,time=“09:15:00”,name=“company_name”,price=324.01

我正在寻找一个(非常)快速高效的dplyr/data.table解决方案。现在我正在使用
substr

date = as.Date(substr(d, 1, 8), "%Y%m%d");
而且要花很长时间才能执行

更新:通过
readr::read_fwf
我可以在5-10分钟内读取文件。显然,读取速度比fread快。代码如下:


也许可以尝试使用带有数字的矩阵,而不是data.frame。聚合应该花费更少的时间

一个可能的解决方案:

library(data.table)
library(stringi)

widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
包括一些额外的处理以获得所需的结果:

DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
   ][, .(date = as.Date(V1, "%Y%m%d"),
         id = as.integer(V2),
         time = as.ITime(V3, "%H%M%S"),
         name = sub("^(bbbb)","",V4),
         price = as.numeric(V5)/100)]
其中:


但实际上您正在读取一个固定宽度的文件。因此,也可以考虑从“基地R”或“代码> Read FFWF <代码> >或编写您自己的<代码> FRAAD。FWF -就像刚才我所做的函数:

fread.fwf <- function(file, widths, enc = "UTF-8") {
  sp <- c(1, cumsum(widths[-length(widths)]) + 1)
  ep <- cumsum(widths)
  fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}

fread.fwf也许你的解决方案还不错

我正在使用这些数据:

df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))
@Jaap解决方案:

> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+                    id = as.integer(substr(text, 9, 16)),
+                    time = substr(text, 17, 22),
+                    name = substr(text, 23, 38),
+                    price = as.numeric(substr(text, 39, 46))/100)])
   user  system elapsed 
   0.17    0.00    0.17 
> library(data.table)
> library(stringi)
> 
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
> 
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+    ][, .(date = as.Date(V1, "%Y%m%d"),
+          id = as.integer(V2),
+          time = V3,
+          name = sub("^(bbbb)","",V4),
+          price = as.numeric(V5)/100)])
   user  system elapsed 
   0.20    0.00    0.21 
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
> 
> ff <- function(x) {
+   file <- textConnection(x)
+   read.fwf(file, c(8, 8, 6, 16, 8),
+            col.names = c("date", "id", "time", "name", "price"),
+            colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
> 
> system.time(df[, as.list(ff(text))])
   user  system elapsed 
   2.33    6.15    8.49 
>库(data.table)
>图书馆(stringi)
> 
>宽度sp ep
>系统时间(df[,lappy(顺时针方向(sp),函数(i)顺时针方向(text,sp[i],ep[i]))
+[,(日期=截止日期(V1,“%Y%m%d”),
+id=作为整数(V2),
+时间=V3,
+名称=子(“^(bbbb)”,“,V4),
+价格=数字(V5)/100)])
用户系统运行时间
0.20    0.00    0.21 
尝试读取。fwf:

> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+                    id = as.integer(substr(text, 9, 16)),
+                    time = substr(text, 17, 22),
+                    name = substr(text, 23, 38),
+                    price = as.numeric(substr(text, 39, 46))/100)])
   user  system elapsed 
   0.17    0.00    0.17 
> library(data.table)
> library(stringi)
> 
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
> 
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+    ][, .(date = as.Date(V1, "%Y%m%d"),
+          id = as.integer(V2),
+          time = V3,
+          name = sub("^(bbbb)","",V4),
+          price = as.numeric(V5)/100)])
   user  system elapsed 
   0.20    0.00    0.21 
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
> 
> ff <- function(x) {
+   file <- textConnection(x)
+   read.fwf(file, c(8, 8, 6, 16, 8),
+            col.names = c("date", "id", "time", "name", "price"),
+            colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
> 
> system.time(df[, as.list(ff(text))])
   user  system elapsed 
   2.33    6.15    8.49 
>setClass(“myDate”)
>setAs(“字符”、“myDate”、函数(from)as.Date(from,format=“%Y%m%d”))
>setClass(“myNumeric”)
>setAs(“字符”,“myNumeric”,函数(from)as.numeric(from)/100)
> 
>ff系统时间(df[,as.list(ff(文本))]))
用户系统运行时间
2.33    6.15    8.49 

所有输出都是相同的。

您是否尝试read.fwf将固定的子字符串直接读取到不同的列中?相关:@Henrik感谢您指出。我不知道固定宽度读数(这正是我想要的)。Thans@Jaap是解决方案。我已经尝试了
readr::read_fwf
,这给了我一个令人满意的性能。我还没有尝试你的解决方案。
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
> 
> ff <- function(x) {
+   file <- textConnection(x)
+   read.fwf(file, c(8, 8, 6, 16, 8),
+            col.names = c("date", "id", "time", "name", "price"),
+            colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
> 
> system.time(df[, as.list(ff(text))])
   user  system elapsed 
   2.33    6.15    8.49