Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/78.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
根据一列中的条件在dataframe中创建新变量,从另一列中提取?(dplyr)_R_Dplyr - Fatal编程技术网

根据一列中的条件在dataframe中创建新变量,从另一列中提取?(dplyr)

根据一列中的条件在dataframe中创建新变量,从另一列中提取?(dplyr),r,dplyr,R,Dplyr,我有以下数据帧: df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", "Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", "Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 1L), topCommodInCountry =

我有以下数据帧:

    df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia", 
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana", 
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper", 
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds", 
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
    8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L, 
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
    country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia", 
    "Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country", 
"CommodRank", "topCommodInCountry", "Main_Commod"))

df

            country CommodRank topCommodInCountry Main_Commod
1             Ghana          1               TRUE        Gold
2           Eritrea          2              FALSE      Copper
3          Ethiopia          3              FALSE      Nickel
4          Ethiopia          1               TRUE        Gold
5  Congo - Kinshasa          3              FALSE        Gold
6          Ethiopia          1               TRUE        Gold
7          Ethiopia          1               TRUE        Gold
8             Ghana          1               TRUE        Gold
9          Botswana          1               TRUE    Diamonds
10          Nigeria          1               TRUE    Iron Ore  

我正在寻找一个理想的dplyr解决方案,我可以将其添加到现有的长系列管道%>%函数调用中,但任何解决方案都会有所帮助。

IIUC,有多种方法可以做到这一点,例如:

df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
                       else Main_Commod[which.max(topCommodInCountry)])

# A tibble: 10 x 5
# Groups:   country [6]
   country          CommodRank topCommodInCountry Main_Commod topCom  
   <chr>                 <int> <lgl>              <chr>       <chr>   
 1 Ghana                     1 TRUE               Gold        Gold    
 2 Eritrea                   2 FALSE              Copper      unknown 
 3 Ethiopia                  3 FALSE              Nickel      Gold    
 4 Ethiopia                  1 TRUE               Gold        Gold    
 5 Congo - Kinshasa          3 FALSE              Gold        unknown 
 6 Ethiopia                  1 TRUE               Gold        Gold    
 7 Ethiopia                  1 TRUE               Gold        Gold    
 8 Ghana                     1 TRUE               Gold        Gold    
 9 Botswana                  1 TRUE               Diamonds    Diamonds
10 Nigeria                   1 TRUE               Iron Ore    Iron Ore

如果一个国家有多个独特的顶级商品,它们将被粘贴到一个字符串中,并用另一个带有dplyr的模式分隔开来

df %>% arrange(CommodRank) %>%
    mutate(topCommod = Main_Commod[1])

这不是一个答案,但从@docendo discimus answer中学到了很多东西,我花了一秒钟时间才理解“如果是否定的”(
!any(topcomodincountry)
),我想知道是不是只有我一个人,还是我的电脑也需要一秒钟才能做到这一点:)

使用相同的数据集,我检查了将
if-else
设为正的想法。首先,我测试了两种解决方案之间的
相同

identical(
  #Negative
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)]), 
  #Positive
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else "unknown"))

[1] TRUE
接下来,我测试了这两个测试的基准:

require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))
差异并没有那么大,但我假设数据集越大,差异就会越大

      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0 

除了@Ryan comment之外,对整个数据帧(组)进行排序要比获取单个列(组)的最大值慢得多。如果您没有正确排列数据集,那么执行“Main_Commod[1]”可能非常危险/错误。非常感谢!在你的脑海中,有没有一种明显的方法可以分割领带并给领带贴上标签,这样topCom就会被分配到“Gold/Diamonds/…”之类的东西上?(假设有2个或更多的Main_Commods,commodaware==1)无需使用“which”而不是“which.max”来获取所有行索引,然后可以访问这些索引并将唯一名称粘贴在一起:
df%>%mutate(topCom=Main_Commods[which(topcomodincountry==max(topcomodincountry)))%>%unique%>%paste(sep='',collapse='/')))
在我四岁的笔记本电脑上,为(我在1:1e6中)运行
!正确
大约需要十分之一秒。不值得担心。可能值得删除不必要的
只是为了可读性,但是为了它的价值,我认为如果
,它是非常直观的被解读为“非”,即“如果不是任何topCommodInCountry”
require(rbenchmark)

benchmark("Negative" = {
  df %>% 
    mutate(topCom = if(!any(topCommodInCountry)) "unknown" 
           else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
  df %>% 
    mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)] 
           else  "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))
      test replications elapsed relative user.self sys.self
1 Negative        10000   12.59    1.015     12.44        0
2 Positive        10000   12.41    1.000     12.30        0