为什么同一个查询使用dplyr在不同的R会话上返回不同的结果？_R_Dplyr_Tidyverse_Rstudio Server

为什么同一个查询使用dplyr在不同的R会话上返回不同的结果？

为什么同一个查询使用dplyr在不同的R会话上返回不同的结果？,r,dplyr,tidyverse,rstudio-server,R,Dplyr,Tidyverse,Rstudio Server,当我和我的一位同事在一个项目中工作时，涉及到使用tidyverse的包dplyr来操作数据帧，我注意到我们的一些结果是不同的，即使我们使用相同的代码和相同的数据来自两个R会话的会话信息：桌面： > sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products

当我和我的一位同事在一个项目中工作时，涉及到使用tidyverse的包dplyr来操作数据帧，我注意到我们的一些结果是不同的，即使我们使用相同的代码和相同的数据

来自两个R会话的会话信息：

桌面：

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

RStudio云

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomNames_1.4-0.0  plotly_4.9.2.1       lubridate_1.7.9     
 [4] openintro_2.0.0      usdata_0.1.0         cherryblossom_0.1.0 
 [7] airports_0.1.0       leaflet_2.0.3        forcats_0.5.0       
[10] stringr_1.4.0        dplyr_1.0.0          purrr_0.3.4         
[13] readr_1.3.1          tidyr_1.1.0          tibble_3.0.2        
[16] ggplot2_3.3.2        tidyverse_1.3.0      shinydashboard_0.7.1
[19] shiny_1.5.0

使用Iris的可复制示例：

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

问题是，它返回两个不同的结果，一个在我的桌面上，另一个在我朋友的桌面上（他使用的是Rstudio Cloud）

我的桌面：

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

#一个tible:3 x 2
物种n
1刚毛50
2彩色50
3弗吉尼亚州50

Rstudio云：

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87


使用'n'作为权重变量
ℹ 使用'wt=n'使此消息安静，或使用'wt=1'计数行`
#一个tibble:3x2
物种n
1刚毛83
2彩色80
3弗吉尼亚州87

我最终通过使用以下查询解决了此问题：

iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>%
  select(Species) %>% 
  group_by(Species) %>% 
  count()

# A tibble: 3 x 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

iris\u big%>%
按（名称、种类）分组%>%
计数（）%>%
解组（）%>%
选择（物种）%>%
组别(种类)%>%
计数（）
#一个tibble:3x2
#类群：种[3]
物种n
1刚毛50
2彩色50
3弗吉尼亚州50

但我想知道为什么会发生这种情况。

您使用的是

sample

，它使用的是离散均匀分布

在R（和相关）中，讨论并解决了非均匀采样问题。这在R-3.6中生效

这可以简单地证明：

R-3.5.3-64位（win10）

set.seed（123）；样本（5）
# [1] 2 4 5 3 1

R-3.6.1-64位（win10）

set.seed（123）；样本（5）
# [1] 3 2 5 4 1

R-4.0.2-64位（win10）

set.seed（123）；样本（5）
# [1] 3 2 5 4 1

在R-3.6及更新版本中，您可以使用以下各项返回到3.6之前的采样：

RNGkind（sample.kind=“舍入”）
#RNGkind中的警告（sample.kind=“舍入”）：
#使用非均匀“取整”取样器
种子（123）；样本（5）
# [1] 2 4 5 3 1

我认为你没有得到你认为的东西。考虑：

> unique(iris_big$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica
> sum(iris_big$Species == 'setosa')
[1] 83
> sum(iris_big$Species == 'versicolor')
[1] 80

您想简化为什么？

（首先，我将此作为备选答案提交，因为我的（关于R-3.5和R-3.6之间

sample.int

的变化）似乎仍然与“为什么相同的查询在不同的R会话上返回不同的结果”的问题相关；这并不是导致这一症状的原因，但很可能是因为您的问题的第一个版本使用了样本。相反，真正的罪魁祸首是dplyr中同样“重大”的版本更改。）

您正在经历

dplyr:：count

的行为发生重大变化

在dplyr-0.8.3中，

？count

表示：

wt：（可选）如果省略（并且在
数据），将计算行数。如有指定，将
通过求和（非缺失）数据来执行“加权”计数
变量“wt”的值。名为'n'（但不是'nn'的列）或
默认情况下，“nnn”）将用作中的权重变量
“tally（）”，但不在“count（）”中。这个论点是正确的
在上下文中自动引用并随后进行评估
数据帧。它支持取消引用。看见
“vignette（“programming”）”介绍这些
概念。

在dplyr-1.0.0中：

wt：频率权重。可以是变量（或
变量的组合）或“NULL”wt'计算一次
对于计数变量的每个唯一组合。
•如果变量“count（）”将计算每个变量的“sum（wt）”
独特的组合。
•如果默认值为“NULL”，则计算取决于
中是否存在频率计数“n”列
数据帧。如果存在，则使用
每个唯一组合的“总和（n）”。否则，'n（）
用于计算计数。提供“wt=n（）”强制
即使数据中有“n”列，也会出现这种行为
框架

要看到的重要部分是，在0.8.3中，它说“名为'n'的列…将在'tally（）'中使用，但在'count（）'中不使用”。但是，在1.0.0中，它不包括该措辞。我使用R-3.5.3/dplyr-0.8.3和R-4.0.2/dplyr-1.0.0复制了您的结果

解决方法有两种：

使用计数（…，wt=n（））：

R.version$version.string
#[1]“R版本3.5.3（2019-03-11）”
iris_big%>%
按（名称、种类）分组%>%
计数（）%>%
解组（）%>%
计数（种类，wt=n（））
##tibble:3 x 2
#物种n
#         
#1刚毛50
#2彩色50
#3弗吉尼亚州50

R.version$version.string
#[1]“R版本4.0.2（2020-06-22）”
iris_big%>%
按（名称、种类）分组%>%
计数（）%>%
解组（）%>%
计数（种类，wt=n（））
##tibble:3 x 2
#物种n
#         
#1刚毛50
#2彩色50
#3弗吉尼亚州50

在分组中切换到使用

计数

，如

iris\u big%>%
按（名称、种类）分组%>%
计数（）%>%
组别(种类)%>%
理货

或者，您可以选择其他选项：

认识到这是一个问题，这个问题在尚未发布的dplyr-1.0.1（我不知道时间表）中得到了修复。这样，RStudio云用户就可以选择github版本的dplyr，从已经合并的PR中获益。这应该将

count

的行为恢复到1.0.0之前的行为（尽管有此问题）

你试过使用

iris\u big%>%gro吗