Regex 正则表达式提取R中两个逗号之间的文本数据
我在数据框(df)中有一堆文本,通常在一列中包含三行地址,我的目标是提取地区(文本的中心部分),例如: 幸运的是,在95%的情况下,输入数据的人使用逗号分隔我想要的文本,100%的时间以“伦敦”(即逗号空格伦敦)结尾为了清楚地说明问题,我的目标是提取“伦敦”之前和前面逗号之后的文本 我期望的输出是:Regex 正则表达式提取R中两个逗号之间的文本数据,regex,r,stringr,Regex,R,Stringr,我在数据框(df)中有一堆文本,通常在一列中包含三行地址,我的目标是提取地区(文本的中心部分),例如: 幸运的是,在95%的情况下,输入数据的人使用逗号分隔我想要的文本,100%的时间以“伦敦”(即逗号空格伦敦)结尾为了清楚地说明问题,我的目标是提取“伦敦”之前和前面逗号之后的文本 我期望的输出是: Wandsworth Lambeth 我可以在以下情况之前提取零件: df$extraction <- sub('.*,\\s*','',address) df$extraction您可以
Wandsworth
Lambeth
我可以在以下情况之前提取零件:
df$extraction <- sub('.*,\\s*','',address)
df$extraction您可以省去正则表达式的麻烦,将向量当作CSV,使用文件读取函数提取相关部分。我们可以使用read.csv()
,利用colClasses
可以用来删除列这一事实
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
以下是几种方法:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
或
你可以试试这个
(?<=, )(.+?),
(?这里有两个选项不依赖于城市名称是否相同。第一个选项使用带有stringr::stru extract()
的正则表达式模式:
拆分是在,\\s*
上完成的,以防逗号后没有空格或有多个空格。这应该可以在95%的时间内工作。其他5%是数据提供程序的错误。这确实有效,但它会在每个字符串的末尾留下一个逗号。
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"
(?<=, )(.+?),
raw_address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London",
"Street, District, City"
)
df <- data.frame(raw_address, stringsAsFactors = FALSE)
df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')
> df
raw_address distict
1 73 Greenhill Gardens, Wandsworth, London Wandsworth
2 22 Acacia Heights, Lambeth, London Lambeth
3 Street, District, City District
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1)
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)
> df
raw_address address distict city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2 22 Acacia Heights, Lambeth, London 22 Acacia Heights Lambeth London
3 Street, District, City Street District City