R 从列表中子集逗号分隔的字符串

R 从列表中子集逗号分隔的字符串,r,dplyr,R,Dplyr,这似乎是一个简单的操作,但我似乎被卡住了,正在寻找指针 我有一个作者及其相关出版物的数据框架。在“作者”列中,在分号分隔的列表中,一篇文章常常有多个作者。以下是一小部分: structure(list(author = c("Moscatelli, Adriana; Nishina, Adrienne", "Asangba, Abigail", "Stewart, Abigail", "Redmond-Sanogo, Adrienne; Lee, Ahlam", "Purnamasari,

这似乎是一个简单的操作,但我似乎被卡住了,正在寻找指针

我有一个作者及其相关出版物的数据框架。在“作者”列中,在分号分隔的列表中,一篇文章常常有多个作者。以下是一小部分:

structure(list(author = c("Moscatelli, Adriana; Nishina, Adrienne", 
"Asangba, Abigail", "Stewart, Abigail", "Redmond-Sanogo, Adrienne; Lee, Ahlam", 
"Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana", 
"Nishina, Adrienne", "Lee, Ahlam", 
"Lee, Ahlam; Cloutier, Aimee", "Kleihauer, Jay; Stephens, Roy; Hart, William", 
"Foor, Ryan M.; Cano, Jamie"), pubtitle = c("AIP Conference Proceedings", 
"Journal of Case Studies in Accreditation and Assessment", "173rd Meeting of Acoustical Society of America", 
"Journal of Research in Gender Studies", "Journal of Research in Gender Studies", 
"Scientometrics", "Journal of Agricultural Education", "Journal of Agricultural Education", 
"Journal of Agricultural Education", "Journal of Agricultural Education"
)), class = c("rowwise_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-10L))
我有第二个数据框,只有作者的名字。以下是这些名称的子集,用于再现性:

structure(list(author = c("Asangba, Abigail", "Stewart, Abigail", 
"Moscatelli, Adriana", "Nishina, Adrienne", "Redmond-Sanogo, Adrienne", 
"Purnamasari, Agustina", "Lee, Ahlam", "Aliyeva, Aida", "Belanger, Aimee", 
"Cloutier, Aimee")), row.names = c(NA, 10L), class = "data.frame")
我试图使用第二个数据帧从原始数据帧中对数据进行子集划分,但遇到了分号分隔名称的挑战

我以为这会让我达到目的,但到目前为止运气不好。我尝试将分隔字符串更改为向量,然后与作者列表进行匹配,但它只返回单独出现的名称,或者,在字符串中出现的名称中没有匹配项

list_authors_female <- data %>% 
  select(author, pubtitle) %>% 
  filter(author %in% female_authors_all)

有什么建议吗?谢谢

创建一个格式为author1 | author2 |…| authorN的正则表达式pat,并将其应用于pub。使用这种方法,不需要拆分

pat <- authors %>% 
  rowwise %>% 
  mutate(author = toString(author)) %>%
  ungroup %>%
  { paste(.$author, collapse = "|") }

pubs %>% filter(grepl(pat, author))
给予:

# A tibble: 8 x 2
  author                                 pubtitle                               
  <chr>                                  <chr>                                  
1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings             
2 Asangba, Abigail                       Journal of Case Studies in Accreditati~
3 Stewart, Abigail                       173rd Meeting of Acoustical Society of~
4 Redmond-Sanogo, Adrienne; Lee, Ahlam   Journal of Research in Gender Studies  
5 Purnamasari, Agustina; Lee, Ahlam; Mo~ Journal of Research in Gender Studies  
6 Nishina, Adrienne                      Scientometrics                         
7 Lee, Ahlam                             Journal of Agricultural Education      
8 Lee, Ahlam; Cloutier, Aimee            Journal of Agricultural Education  

创建格式为author1 | author2 |…| authorN的正则表达式pat,并将其应用于pub。使用这种方法,不需要拆分

pat <- authors %>% 
  rowwise %>% 
  mutate(author = toString(author)) %>%
  ungroup %>%
  { paste(.$author, collapse = "|") }

pubs %>% filter(grepl(pat, author))
给予:

# A tibble: 8 x 2
  author                                 pubtitle                               
  <chr>                                  <chr>                                  
1 Moscatelli, Adriana; Nishina, Adrienne AIP Conference Proceedings             
2 Asangba, Abigail                       Journal of Case Studies in Accreditati~
3 Stewart, Abigail                       173rd Meeting of Acoustical Society of~
4 Redmond-Sanogo, Adrienne; Lee, Ahlam   Journal of Research in Gender Studies  
5 Purnamasari, Agustina; Lee, Ahlam; Mo~ Journal of Research in Gender Studies  
6 Nishina, Adrienne                      Scientometrics                         
7 Lee, Ahlam                             Journal of Agricultural Education      
8 Lee, Ahlam; Cloutier, Aimee            Journal of Agricultural Education  
我们可以使用tidyverse方法。将:分隔符处的“author”分隔为“long”格式,然后进行内部连接(稍后按已创建的行号列分组),将“author”元素粘贴回单个字符串

library(tidyverse)
df1 %>%
  rownames_to_column('rn') %>% 
  separate_rows(author, sep=";\\s*") %>%
  inner_join(df2)%>% 
  group_by(rn, pubtitle) %>% 
  summarise(author = str_c(author, collapse = "; ")) %>%
  ungroup %>%
  select(names(df1))
# A tibble: 8 x 2
#  author                                                 pubtitle                                               
#  <chr>                                                  <chr>                                                  
#1 Moscatelli, Adriana; Nishina, Adrienne                 AIP Conference Proceedings                             
#2 Asangba, Abigail                                       Journal of Case Studies in Accreditation and Assessment
#3 Stewart, Abigail                                       173rd Meeting of Acoustical Society of America         
#4 Redmond-Sanogo, Adrienne; Lee, Ahlam                   Journal of Research in Gender Studies                  
#5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana Journal of Research in Gender Studies                  
#6 Nishina, Adrienne                                      Scientometrics                                         
#7 Lee, Ahlam                                             Journal of Agricultural Education                      
#8 Lee, Ahlam; Cloutier, Aimee                            Journal of Agricultural Education         
我们可以使用tidyverse方法。将:分隔符处的“author”分隔为“long”格式,然后进行内部连接(稍后按已创建的行号列分组),将“author”元素粘贴回单个字符串

library(tidyverse)
df1 %>%
  rownames_to_column('rn') %>% 
  separate_rows(author, sep=";\\s*") %>%
  inner_join(df2)%>% 
  group_by(rn, pubtitle) %>% 
  summarise(author = str_c(author, collapse = "; ")) %>%
  ungroup %>%
  select(names(df1))
# A tibble: 8 x 2
#  author                                                 pubtitle                                               
#  <chr>                                                  <chr>                                                  
#1 Moscatelli, Adriana; Nishina, Adrienne                 AIP Conference Proceedings                             
#2 Asangba, Abigail                                       Journal of Case Studies in Accreditation and Assessment
#3 Stewart, Abigail                                       173rd Meeting of Acoustical Society of America         
#4 Redmond-Sanogo, Adrienne; Lee, Ahlam                   Journal of Research in Gender Studies                  
#5 Purnamasari, Agustina; Lee, Ahlam; Moscatelli, Adriana Journal of Research in Gender Studies                  
#6 Nishina, Adrienne                                      Scientometrics                                         
#7 Lee, Ahlam                                             Journal of Agricultural Education                      
#8 Lee, Ahlam; Cloutier, Aimee                            Journal of Agricultural Education         

如果您愿意使用tidyr包,有一些很酷的工具可以用来分隔分隔列表。特别是分开和分开


如果您愿意使用tidyr包,有一些很酷的工具可以用来分隔分隔列表。特别是分开和分开

是否需要df1%>%rownames\u到\u列''>%n分离\u rowsauthor,sep=\\s*%%>%内部接缝F2%%>%组_byrn,pubtitle%%>%SummariseAuther=str\u Cautor,collapse=;%%>%解组%>%选择rn是否需要df1%>%rownames\u到\u列'rn'>%分开\u rowsauthor,sep=\\s*%%>%内部接缝F2%%>%组_byrn,pubtitle%%>%SummariseAuther=str\u Cautor,collapse=;%%>%解组%>%选择rn
data3 <- data %>%
  # If you want to keep the original names duplicate column first
  mutate(author_sep = author) %>%
  # Take each delimited author and give them their own row (tidy data)
  tidyr::separate_rows(author_sep,sep = ";") %>%
  # inner_join to keep only females
  inner_join(female_authors_all,by = c("author_sep" = "author")) %>%
  # Remove that extra column we created
  select(-author_sep) %>%
  # Remove duplicate rows in case more than one author is the delimited list was female
  distinct()