在R|性能改进中处理JSON有更好的方法吗_Json_R

在R|性能改进中处理JSON有更好的方法吗

json r

在R|性能改进中处理JSON有更好的方法吗,json,r,Json,R,数据准备 comp <- c('[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]', '[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]') id = c(1,2) jsonData = as.data.frame(id,comp) jsonData

数据准备

 comp <- 
 c('[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]', 
 '[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]')
 id = c(1,2)
 jsonData = as.data.frame(id,comp)
 jsonData
                                                                   id
[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]     1
[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]  2

compJSON是文本。文本解析很慢。也不确定为什么library（dplyr）
会出现，因为它随tidyverse
一起出现。而且，你应该考虑阅读如何制作数据帧。
不管怎样。我们将制作一个具有代表性的示例：500000行：
library(tidyverse)

data_frame(
  id = rep(c(1L, 2L), 250000),
  comp = rep(c(
    '[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]', 
    '[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]'
  ), 250000)
) -> xdf

R中有许多JSON处理包。请测试一些。这使用了ndjson
，它有一个函数flatte（）
，该函数获取JSON字符串的字符向量，并从中生成一个“完全平坦”的结构
我只是使用不同的数据帧变量来解释清楚和以后的基准测试
pull(xdf, comp) %>% 
  ndjson::flatten() %>% 
  bind_cols(select(xdf, id)) -> ydf

这使得：
ydf
## Source: local data table [500,000 x 5]
## 
## # A tibble: 500,000 x 5
##    `0.id` `0.name`  `1.id` `1.name`     id
##     <dbl> <chr>      <dbl> <chr>     <int>
##  1    28. Google       12. Microsoft     1
##  2    32. Microsoft   878. Facebook      2
##  3    28. Google       12. Microsoft     1
##  4    32. Microsoft   878. Facebook      2
##  5    28. Google       12. Microsoft     1
##  6    32. Microsoft   878. Facebook      2
##  7    28. Google       12. Microsoft     1
##  8    32. Microsoft   878. Facebook      2
##  9    28. Google       12. Microsoft     1
## 10    32. Microsoft   878. Facebook      2
## # ... with 499,990 more rows

因此：

1000行15毫秒
对于500000，15ms*500=7.5s

如果您对id1
列需要是整数的要求不那么迂腐，那么您可能会减少几毫秒
还有其他方法。而且，如果您经常使用JSON数据列，我强烈建议您查看Apache Drill和sergeant
包。将JSON转换为普通列？“性能命中”？我的意思是将JSON值即“{“id”：28，“name”：“Google”}”转换为数据表的正常列，以便我们可以应用dplyr查询，否则dplyr将JSON作为纯文本。当我处理超过50万个JSON列时，转换它的方式需要几分钟，这是一个性能测试。谢谢，这将对我有很大帮助，也为研究提供了数据点
ydf
## Source: local data table [500,000 x 5]
## 
## # A tibble: 500,000 x 5
##    `0.id` `0.name`  `1.id` `1.name`     id
##     <dbl> <chr>      <dbl> <chr>     <int>
##  1    28. Google       12. Microsoft     1
##  2    32. Microsoft   878. Facebook      2
##  3    28. Google       12. Microsoft     1
##  4    32. Microsoft   878. Facebook      2
##  5    28. Google       12. Microsoft     1
##  6    32. Microsoft   878. Facebook      2
##  7    28. Google       12. Microsoft     1
##  8    32. Microsoft   878. Facebook      2
##  9    28. Google       12. Microsoft     1
## 10    32. Microsoft   878. Facebook      2
## # ... with 499,990 more rows

bind_rows(
  select(ydf, id = id, id1=`0.id`, name=`0.name`),
  select(ydf, id = id, id1=`1.id`, name=`1.name`)
) %>% 
  mutate(id1 = as.integer(id1))
## Source: local data table [1,000,000 x 3]
## 
## # A tibble: 1,000,000 x 3
##       id   id1 name     
##    <int> <int> <chr>    
##  1     1    28 Google   
##  2     2    32 Microsoft
##  3     1    28 Google   
##  4     2    32 Microsoft
##  5     1    28 Google   
##  6     2    32 Microsoft
##  7     1    28 Google   
##  8     2    32 Microsoft
##  9     1    28 Google   
## 10     2    32 Microsoft
## # ... with 999,990 more rows

data_frame(
  id = rep(c(1L, 2L), 500),
  comp = rep(c(
    '[{"id": 28, "name": "Google"}, {"id": 12, "name": "Microsoft"}]', 
    '[{"id": 32, "name": "Microsoft"}, {"id": 878, "name": "Facebook"}]'
  ), 500)
) -> xdf

microbenchmark::microbenchmark(
  faster = {
    pull(xdf, comp) %>% 
      ndjson::flatten() %>% 
      bind_cols(select(xdf, id)) -> ydf

    bind_rows(
      select(ydf, id = id, id1=`0.id`, name=`0.name`),
      select(ydf, id = id, id1=`1.id`, name=`1.name`)
    ) %>% 
      mutate(id1 = as.integer(id1))
  }
)
## Unit: milliseconds
##    expr      min       lq     mean   median       uq      max neval
##  faster 12.46409 13.71483 14.73997 14.40582 15.47529 21.09543   100