sparklyr:：sdf_分位数（）错误_R_Apache Spark_Sparklyr

sparklyr:：sdf_分位数（）错误

r apache-spark

sparklyr:：sdf_分位数（）错误,r,apache-spark,sparklyr,R,Apache Spark,Sparklyr,我知道spark 1.6.0可能已经过时了，但我们已经准备好了。正在尝试使用sparklyr:：sdf\u分位数（）这是这台机器的sessionInfo（） sessionInfo() Oracle Distribution of R version 3.3.0 (--) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Oracle Linux Server 7.2 locale: [1] LC_CTYPE=en_US.UTF-

我知道spark 1.6.0可能已经过时了，但我们已经准备好了。正在尝试使用

sparklyr:：sdf\u分位数（）

这是这台机器的sessionInfo（）

sessionInfo()
Oracle Distribution of R version 3.3.0  (--)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Oracle Linux Server 7.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kudusparklyr_0.1.0  sparklyr_0.7.0      dbplot_0.2.0        rlang_0.1.4        
 [5] bindrcpp_0.2        anytime_0.3.0       jsonlite_1.5        magrittr_1.5       
 [9] ggplot2_2.2.1       DBI_0.7             dtplyr_0.0.2        dplyr_0.7.4        
[13] data.table_1.10.4-3 devtools_1.13.4     httr_1.3.1         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14       dbplyr_1.1.0       plyr_1.8.4         bindr_0.1         
 [5] base64enc_0.1-3    tools_3.3.0        digest_0.6.12      lattice_0.20-33   
 [9] nlme_3.1-127       memoise_1.1.0      tibble_1.3.4       gtable_0.2.0      
[13] pkgconfig_2.0.1    psych_1.7.8        shiny_1.0.5        rstudioapi_0.7    
[17] yaml_2.1.15        parallel_3.3.0     stringr_1.2.0      withr_2.1.0       
[21] rprojroot_1.2      grid_3.3.0         glue_1.2.0         R6_2.2.2          
[25] foreign_0.8-66     reshape2_1.4.2     purrr_0.2.4        tidyr_0.7.2       
[29] scales_0.5.0       backports_1.1.1    htmltools_0.3.6    mnormt_1.5-5      
[33] assertthat_0.2.0   xtable_1.8-2       mime_0.5           RApiDatetime_0.0.3
[37] colorspace_1.3-2   httpuv_1.3.5       labeling_0.3       config_0.2        
[41] stringi_1.1.6      openssl_0.9.9      lazyeval_0.2.1     munsell_0.4.3     
[45] broom_0.4.3

在另一台机器上（本地使用spark 2.2.0）它正在运行：

mtc %>% sdf_quantile("hp")
  0%  25%  50%  75% 100% 
  52   95  123  180  335

使用以下sessionInfo：

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rsparkling_0.2.2    leaflet_1.1.0       dplyr_0.7.4         purrr_0.2.4        
 [5] readr_1.1.1         tidyr_0.6.1         tibble_1.4.1        ggplot2_2.2.1      
 [9] tidyverse_1.1.1     sparklyr_0.7.0-9030

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     lubridate_1.6.0  lattice_0.20-35  assertthat_0.2.0 rprojroot_1.2   
 [6] digest_0.6.12    psych_1.7.3.21   mime_0.5         R6_2.2.2         cellranger_1.1.0
[11] plyr_1.8.4       backports_1.0.5  evaluate_0.10    httr_1.2.1       pillar_1.0.1    
[16] rlang_0.1.6      lazyeval_0.2.0   readxl_1.0.0     rstudioapi_0.7   rmarkdown_1.6   
[21] config_0.2       stringr_1.2.0    foreign_0.8-69   htmlwidgets_0.8  RCurl_1.95-4.8  
[26] munsell_0.4.3    shiny_1.0.5      broom_0.4.2      compiler_3.4.1   httpuv_1.3.5    
[31] modelr_0.1.0     pkgconfig_2.0.1  base64enc_0.1-3  mnormt_1.5-5     htmltools_0.3.5 
[36] openssl_0.9.7    withr_2.0.0      dbplyr_1.2.0     rappdirs_0.3.1   bitops_1.0-6    
[41] grid_3.4.1       nlme_3.1-131     jsonlite_1.5     xtable_1.8-2     gtable_0.2.0    
[46] DBI_0.7          magrittr_1.5     scales_0.4.1     stringi_1.1.3    reshape2_1.4.2  
[51] bindrcpp_0.2     xml2_1.1.1       tools_3.4.1      forcats_0.2.0    glue_1.2.0      
[56] hms_0.3          crosstalk_1.0.0  parallel_3.4.1   yaml_2.1.14      colorspace_1.3-2
[61] h2o_3.14.0.2     rvest_0.3.2      knitr_1.15.1     bindr_0.1        haven_1.0.0

有什么问题吗？

Spark 2.0-中引入了近似分位数。您必须更新Apache Spark安装才能使用它

如果启用了配置单元支持，您可以尝试

percentile\u approx

Hive功能：

df <- copy_to(sc, iris)

sc %>% spark_session() %>%
  invoke("sql", "SELECT percentile_approx(Sepal_Length, 0.5) FROM iris") %>% 
  sdf_register("median")

# # Source:   table<median> [?? x 1]
# # Database: spark_connection
#   `_c0`
#   <dbl>
# 1  5.73

df%spark\u会话（）%>%
调用（“sql”，“从iris中选择大约百分位数（萼片长度，0.5”）%>%
sdf_寄存器（“中值”）
##来源：表[？x 1]
##数据库：spark_连接
#`u c0`
#   
# 1  5.73

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rsparkling_0.2.2    leaflet_1.1.0       dplyr_0.7.4         purrr_0.2.4        
 [5] readr_1.1.1         tidyr_0.6.1         tibble_1.4.1        ggplot2_2.2.1      
 [9] tidyverse_1.1.1     sparklyr_0.7.0-9030

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12     lubridate_1.6.0  lattice_0.20-35  assertthat_0.2.0 rprojroot_1.2   
 [6] digest_0.6.12    psych_1.7.3.21   mime_0.5         R6_2.2.2         cellranger_1.1.0
[11] plyr_1.8.4       backports_1.0.5  evaluate_0.10    httr_1.2.1       pillar_1.0.1    
[16] rlang_0.1.6      lazyeval_0.2.0   readxl_1.0.0     rstudioapi_0.7   rmarkdown_1.6   
[21] config_0.2       stringr_1.2.0    foreign_0.8-69   htmlwidgets_0.8  RCurl_1.95-4.8  
[26] munsell_0.4.3    shiny_1.0.5      broom_0.4.2      compiler_3.4.1   httpuv_1.3.5    
[31] modelr_0.1.0     pkgconfig_2.0.1  base64enc_0.1-3  mnormt_1.5-5     htmltools_0.3.5 
[36] openssl_0.9.7    withr_2.0.0      dbplyr_1.2.0     rappdirs_0.3.1   bitops_1.0-6    
[41] grid_3.4.1       nlme_3.1-131     jsonlite_1.5     xtable_1.8-2     gtable_0.2.0    
[46] DBI_0.7          magrittr_1.5     scales_0.4.1     stringi_1.1.3    reshape2_1.4.2  
[51] bindrcpp_0.2     xml2_1.1.1       tools_3.4.1      forcats_0.2.0    glue_1.2.0      
[56] hms_0.3          crosstalk_1.0.0  parallel_3.4.1   yaml_2.1.14      colorspace_1.3-2
[61] h2o_3.14.0.2     rvest_0.3.2      knitr_1.15.1     bindr_0.1        haven_1.0.0

df <- copy_to(sc, iris)

sc %>% spark_session() %>%
  invoke("sql", "SELECT percentile_approx(Sepal_Length, 0.5) FROM iris") %>% 
  sdf_register("median")

# # Source:   table<median> [?? x 1]
# # Database: spark_connection
#   `_c0`
#   <dbl>
# 1  5.73