Regex 正则表达式提取R中两个逗号之间的文本数据_Regex_R_Stringr

Regex 正则表达式提取R中两个逗号之间的文本数据

regex r

Regex 正则表达式提取R中两个逗号之间的文本数据,regex,r,stringr,Regex,R,Stringr,我在数据框（df）中有一堆文本，通常在一列中包含三行地址，我的目标是提取地区（文本的中心部分），例如：幸运的是，在95%的情况下，输入数据的人使用逗号分隔我想要的文本，100%的时间以“伦敦”（即逗号空格伦敦）结尾为了清楚地说明问题，我的目标是提取“伦敦”之前和前面逗号之后的文本我期望的输出是： Wandsworth Lambeth 我可以在以下情况之前提取零件： df$extraction <- sub('.*,\\s*','',address) df$extraction您可以

我在数据框（df）中有一堆文本，通常在一列中包含三行地址，我的目标是提取地区（文本的中心部分），例如：

幸运的是，在95%的情况下，输入数据的人使用逗号分隔我想要的文本，100%的时间以“伦敦”（即逗号空格伦敦）结尾为了清楚地说明问题，我的目标是提取“伦敦”之前和前面逗号之后的文本

我期望的输出是：

Wandsworth
Lambeth

我可以在以下情况之前提取零件：

df$extraction <- sub('.*,\\s*','',address)

df$extraction您可以省去正则表达式的麻烦，将向量当作CSV，使用文件读取函数提取相关部分。我们可以使用read.csv（）
，利用colClasses
可以用来删除列这一事实
address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"   

以下是几种方法：
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth" 

或
你可以试试这个
(?<=, )(.+?),

（？这里有两个选项不依赖于城市名称是否相同。第一个选项使用带有stringr:：stru extract（）
的正则表达式模式：
拆分是在，\\s*
上完成的，以防逗号后没有空格或有多个空格。这应该可以在95%的时间内工作。其他5%是数据提供程序的错误。这确实有效，但它会在每个字符串的末尾留下一个逗号。
data.table::fread(paste(address, collapse = "\n"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth" 

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth" 

# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth" 

(?<=, )(.+?),

raw_address <- c(
  "73 Greenhill Gardens, Wandsworth, London", 
  "22 Acacia Heights, Lambeth, London",
  "Street, District, City"
)

df <- data.frame(raw_address, stringsAsFactors = FALSE)

df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')

> df
                               raw_address     distict
1 73 Greenhill Gardens, Wandsworth, London  Wandsworth
2       22 Acacia Heights, Lambeth, London     Lambeth
3                   Street, District, City    District

df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1) 
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)

> df
                               raw_address              address    distict   city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2       22 Acacia Heights, Lambeth, London    22 Acacia Heights    Lambeth London
3                   Street, District, City               Street   District   City