如何在dplyr中使用mutate进行grep_R_Dplyr

如何在dplyr中使用mutate进行grep

如何在dplyr中使用mutate进行grep,r,dplyr,R,Dplyr,我想得到一些帮助，了解我的dplyr管道中发生了什么，我请求各种解决方案来解决这个问题问题我有一个研究所的列表（正式术语是指研究期刊文章的作者来自哪里），我想提取主要研究所的名称。如果它是一所大学，它将是XX大学，为了简单起见，这就是我在这里坚持的例子尝试解决方案逻辑用逗号分隔机构名称 “大学”或其他大学相关术语的grep 在命中的位置提取索引边缘案例/假设我正在搜索的术语只存在于一个拆分中这里的所有研究所都是大学（为了避免堆栈溢出，这里的问题保持简单）代码我假设正在发生但

我想得到一些帮助，了解我的

dplyr

管道中发生了什么，我请求各种解决方案来解决这个问题

问题我有一个研究所的列表（正式术语是指研究期刊文章的作者来自哪里），我想提取主要研究所的名称。如果它是一所大学，它将是XX大学，为了简单起见，这就是我在这里坚持的例子

尝试解决方案逻辑

用逗号分隔机构名称

“大学”或其他大学相关术语的grep

在命中的位置提取索引

边缘案例/假设

我正在搜索的术语只存在于一个拆分中
这里的所有研究所都是大学（为了避免堆栈溢出，这里的问题保持简单）

代码我假设正在发生但没有发生的是我上面写的逻辑。我看到的情况是，在mutate中，

institute

的第一个实例正在搜索

df

中的每一行，而完全相同的“新so大学”正在填充。对于错误是什么，我有一个大致的想法，除了不知道为什么会发生错误或者如何在保持

dplyr

的同时修复错误。如果我使用

apply

函数，我就可以做到这一点，我很好奇答案是什么

它看起来像什么：

# A tibble: 6 x 2
  institute                                                                          instGuess              
  <chr>                                                                              <chr>                  
1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
3 department of ece, pesit, bangalore, india                                         " university of new so~
4 school of information technology and electrical engineering, university of queens~ " university of new so~
5 school of information technology and electrical engineering, university of queens~ " university of new so~
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~

#一个tible:6x2
研究所
1新南威尔士大学计算机科学与工程学院，悉尼~新大学
2系计算机科学，弗里德里希亚力山大大学，埃朗根纽伦堡，~新大学
3欧洲经委会、佩西、印度班加罗尔“新苏州大学”
昆士兰大学信息技术与电气工程学院4
昆士兰大学信息技术与电气工程学院5
新加坡国立大学信息科学与技术系，6，10，肯特RI~“新苏州大学”

例如，使用的数据

df您需要包含一个groupby
，以便语法正常工作：
df %>%
  group_by(institute) %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])

产生：
# A tibble: 6 x 2
# Groups:   institute [6]
institute                                                                  instGuess              
<chr>                                                                      <chr>                  
  1 school of computer science and engineering, university of new south wales… " university of new so…
2 department computer science, friedrich-alexander-university, erlangen-nur… " friedrich-alexander-…
3 department of ece, pesit, bangalore, india                                 NA                     
4 school of information technology and electrical engineering, university o… " university of queens…
5 school of information technology and electrical engineering, university o… " university of queens…
6 dept. of info. syst. and comp. sci., national university of singapore, 10… " national university …

#一个tible:6x2
#组别：研究所[6]
研究所
1新南威尔士大学计算机科学与工程学院……
弗里德里希·亚历山大大学计算机科学系，纽尔兰根…“弗里德里希·亚历山大-…
3印度班加罗尔佩西特欧洲经委会部
4所大学信息技术与电气工程学院…
5所大学信息技术与电气工程学院…
6信息部。系统。和comp。SCI，新加坡国立大学，10……国立大学…
看起来只使用了第一个元素。我们可以使用行方式对每行进行分组，并确保操作是特定于行的
library(dplyr)

df %>%
  rowwise() %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
  ungroup() %>%
  head()
# # A tibble: 6 x 2
# institute                                                              instGuess             
#   <chr>                                                                  <chr>                 
# 1 school of computer science and engineering, university of new south w~ " university of new s~
# 2 department computer science, friedrich-alexander-university, erlangen~ " friedrich-alexander~
# 3 department of ece, pesit, bangalore, india                             NA                    
# 4 school of information technology and electrical engineering, universi~ " university of queen~
# 5 school of information technology and electrical engineering, universi~ " university of queen~
# 6 dept. of info. syst. and comp. sci., national university of singapore~ " national university~

库（dplyr）
df%>%
行（）
突变（instGuess=unlist（strsplit（institute，“，”）））[grep（“univ”，unlist（strsplit（institute，“，”）））][1]）%>%
解组（）%>%
总目（）
##tibble:6 x 2
#研究所
#                                                                                      
新南威尔士大学计算机科学与工程学院第1版
#弗里德里希·亚历山大大学计算机科学系，erlangen~“弗里德里希·亚历山大~
#3印度班加罗尔佩西特欧洲经委会部
第4届大学信息技术与电气工程学院
第5届大学信息技术与电气工程学院
新加坡国立大学信息科学与技术系，6。
我认为@Pdubbs的答案是最好的，他使用groupby
来模仿@www使用rowwise（）
的答案，但区别（在我看来是明显的优势）是，当重复$institute
时，每个机构只做一次猜测就能获得效率
这更进一步，不会在每个实例上重新执行strsplit。我将复制第一行：
df <- df[c(1,1:6),]

（并插入消息
调用以指示调用次数…不包括在生产中），然后序列：
df %>%
  group_by(institute) %>%
  mutate(instGuess = find_univ(institute)) %>%
  ungroup() %>%
  select(instGuess) # for display purposes only
# ******  <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
#                           instGuess
#                               <chr>
# 1     university of new south wales
# 2     university of new south wales
# 3    friedrich-alexander-university
# 4                              <NA>
# 5       university of queenslandqld
# 6       university of queenslandold
# 7  national university of singapore

df%>%
组别(学会)%>%
突变（instGuess=find_univ（institute））%>%
解组（）%>%
选择（instGuess）#仅用于显示目的
#******您可以使用sub

a=df %>%
     group_by(institute)%>%
     mutate(Instname=sub("(.*,\\s|)(.*unive.*?)(,|$).*|.*","\\2",institute))
> a
# A tibble: 6 x 2
# Groups:   institute [6]
  institute                                                                                           Instname                   
  <chr>                                                                                               <chr>                      
1 school of computer science and engineering, university of new south wales, sydney, australia        university of new south wa~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, germany            friedrich-alexander-univer~
3 department of ece, pesit, bangalore, india                                                          ""                         
4 school of information technology and electrical engineering, university of queenslandqld, australia university of queenslandqld
5 school of information technology and electrical engineering, university of queenslandold, australia university of queenslandold
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, sin~ national university of sin~
> a$Instname
[1] "university of new south wales"    "friedrich-alexander-university"   ""                                
[4] "university of queenslandqld"      "university of queenslandold"      "national university of singapore"

a=df%>%
组别(学会)%>%
突变（Instname=sub（“（.*，\\s |”）（.*unive.*？（，|$）.*unive.*”，“\\2”，institute））
>a
#一个tibble:6x2
#组别：研究所[6]
学院名称
新南威尔士大学计算机科学与工程学院，悉尼，澳大利亚新南大学WA~
2弗里德里希·亚历山大大学计算机科学系，德国纽伦堡埃兰根弗里德里希·亚历山大大学~
3欧洲经委会部，佩西特，班加罗尔，印度
find_univ <- function(x) {
  message('*', appendLF=FALSE)
  y <- strsplit(x[[1]], ',')[[1]]
  y[grep('univ', y)][1]
}

df %>%
  group_by(institute) %>%
  mutate(instGuess = find_univ(institute)) %>%
  ungroup() %>%
  select(instGuess) # for display purposes only
# ******  <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
#                           instGuess
#                               <chr>
# 1     university of new south wales
# 2     university of new south wales
# 3    friedrich-alexander-university
# 4                              <NA>
# 5       university of queenslandqld
# 6       university of queenslandold
# 7  national university of singapore

a=df %>%
     group_by(institute)%>%
     mutate(Instname=sub("(.*,\\s|)(.*unive.*?)(,|$).*|.*","\\2",institute))
> a
# A tibble: 6 x 2
# Groups:   institute [6]
  institute                                                                                           Instname                   
  <chr>                                                                                               <chr>                      
1 school of computer science and engineering, university of new south wales, sydney, australia        university of new south wa~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, germany            friedrich-alexander-univer~
3 department of ece, pesit, bangalore, india                                                          ""                         
4 school of information technology and electrical engineering, university of queenslandqld, australia university of queenslandqld
5 school of information technology and electrical engineering, university of queenslandold, australia university of queenslandold
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, sin~ national university of sin~
> a$Instname
[1] "university of new south wales"    "friedrich-alexander-university"   ""                                
[4] "university of queenslandqld"      "university of queenslandold"      "national university of singapore"