Regex gsub-在&；之前/之后添加空格；性格_Regex_R_Gsub

Regex gsub-在&；之前/之后添加空格；性格

regex r

Regex gsub-在&；之前/之后添加空格；性格,regex,r,gsub,Regex,R,Gsub,关于stackoverflow的第一篇文章，希望是众多文章中的第一篇我正在清理一个数据集，其中一列包含作者列表。当有多个作者时，这些作者之间用符号隔开，例如Smith&Banks。然而，间距并不总是一致的，例如Smith&Banks，Smith&Banks 为了解决这个问题，我尝试了： gsub('\\S&','\\S &', dataset[,author.col]) 这将提供Smith&Banks->SmitS&Banks。如何获取->Smith&Banks？这

关于stackoverflow的第一篇文章，希望是众多文章中的第一篇

我正在清理一个数据集，其中一列包含作者列表。当有多个作者时，这些作者之间用符号隔开，例如Smith&Banks。然而，间距并不总是一致的，例如Smith&Banks，Smith&Banks

为了解决这个问题，我尝试了：

     gsub('\\S&','\\S &', dataset[,author.col])

这将提供Smith&Banks->SmitS&Banks。如何获取->Smith&Banks？

这里有一个解决方案，它可以两次调用

gsub

：

dataset[,author.col] <- gsub('([^\\s])&','\\1\\s&', dataset[,author.col])
dataset[,author.col] <- gsub('&([^\\s])','&\\s\\1', dataset[,author.col])

dataset[，author.col]这里是一种只使用sub

sub("\\b(?=&)|(?<=&)\\b", " ",  v1, perl = TRUE)
#[1] "Smith & Banks" "Smith & Banks"

本质上，如果存在许多模式，则发现strsplit
方法更好
数据
v1这里是另一种gsub
方法：
# some test cases
authors <- c("Smith& Banks", "Smith   &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

#一些测试用例
作者使用stringi
的过度杀伤力方法：
v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")

library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp  <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp), 
         function(x) { paste0(x, collapse = " & ") }), v[!indx])

还可以尝试以下方法：
gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

此解决方案将在不存在的位置添加&
：“史密斯银行”
->“史密斯银行”
@WiktorStribiż。只需重新阅读问题，并注意到还有非amp。这是一个很好的选项，我以为你删除了它，所以我在strsplit中使用了一个类似的选项。@akrun我最初是按照Wiktor的注释做的，但最终对它进行了调整。如果一个名字有多个ambersand，你的正则表达式可能会产生意想不到的后果。这取决于你对“名字”的定义。字符串中相同名称之间的多个符号（如Smith&Banks
）是有问题的-我同意，但我不理解问题本身-而向量元素中不同名称之间的多个符号（如“Smith&Banks&Nash”
）不会导致任何问题，当然，其他一些答案也不能正确处理多个符号，但我首先希望OP能澄清这是否会发生，谢谢。在多行中有多个符号，但最多只能有一个符号分隔两个名称。是否存在例如Smith&&Banks，即同一作者之间有多个符号的情况？我没有这些情况，不同名称之间唯一的分隔符是符号。
# some test cases
authors <- c("Smith& Banks", "Smith   &Banks", "Smith&Banks", "Smith & Banks")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

authors <- c("Smith& Banks", "Smith   &Banks &Nash", "Smith&Banks", "Smith & Banks", "Smith")
gsub("\\s*&\\s*", " & ", authors)
#[1] "Smith & Banks"        "Smith & Banks & Nash" "Smith & Banks"        "Smith & Banks"        "Smith"

v <- c("Smith & Banks", "Smith& Banks", "Smith &Banks", "Smith&Banks", "Smith Banks")

library(stringi)
#create an index of entries containing "&"
indx <- grepl("&", v)
#subset "v" using that index
amp  <- v[indx]
#perform the transformation on that subset and combine the result with the rest of "v"
c(sapply(stri_extract_all_words(amp), 
         function(x) { paste0(x, collapse = " & ") }), v[!indx])

#[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith Banks" 

data = c("Smith& Banks", "Smith &Banks", "Smith & Banks", 
         "Smith &     Banks", "Smith&Banks")

# Take the 0 or more spaces before and after the ampersand, replace that by " & ""
gsub("[ ]*&[ ]*", " & ", data) 
# [1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"

gsub("([^& ]+)\\W+([^&]+)","\\1 & \\2",authors)
[1] "Smith & Banks" "Smith & Banks" "Smith & Banks" "Smith & Banks"