Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Regex 正则表达式提取R中两个逗号之间的文本数据_Regex_R_Stringr - Fatal编程技术网

Regex 正则表达式提取R中两个逗号之间的文本数据

Regex 正则表达式提取R中两个逗号之间的文本数据,regex,r,stringr,Regex,R,Stringr,我在数据框(df)中有一堆文本,通常在一列中包含三行地址,我的目标是提取地区(文本的中心部分),例如: 幸运的是,在95%的情况下,输入数据的人使用逗号分隔我想要的文本,100%的时间以“伦敦”(即逗号空格伦敦)结尾为了清楚地说明问题,我的目标是提取“伦敦”之前和前面逗号之后的文本 我期望的输出是: Wandsworth Lambeth 我可以在以下情况之前提取零件: df$extraction <- sub('.*,\\s*','',address) df$extraction您可以

我在数据框(df)中有一堆文本,通常在一列中包含三行地址,我的目标是提取地区(文本的中心部分),例如:

幸运的是,在95%的情况下,输入数据的人使用逗号分隔我想要的文本,100%的时间以“伦敦”(即逗号空格伦敦)结尾为了清楚地说明问题,我的目标是提取“伦敦”之前和前面逗号之后的文本

我期望的输出是:

Wandsworth
Lambeth
我可以在以下情况之前提取零件:

df$extraction <- sub('.*,\\s*','',address)

df$extraction您可以省去正则表达式的麻烦,将向量当作CSV,使用文件读取函数提取相关部分。我们可以使用
read.csv()
,利用
colClasses
可以用来删除列这一事实

address <- c(
    "73 Greenhill Gardens, Wandsworth, London", 
    "22 Acacia Heights, Lambeth, London"
)

read.csv(text = address, colClasses = c("NULL", "character", "NULL"), 
    header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"   

以下是几种方法:

# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth" 

你可以试试这个

(?<=, )(.+?),

(?这里有两个选项不依赖于城市名称是否相同。第一个选项使用带有
stringr::stru extract()
的正则表达式模式:


拆分是在
,\\s*
上完成的,以防逗号后没有空格或有多个空格。

这应该可以在95%的时间内工作。其他5%是数据提供程序的错误。这确实有效,但它会在每个字符串的末尾留下一个逗号。
data.table::fread(paste(address, collapse = "\n"), 
    select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth" 
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth" 
# target the whole string, but use a capture group 
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth" 
(?<=, )(.+?),
raw_address <- c(
  "73 Greenhill Gardens, Wandsworth, London", 
  "22 Acacia Heights, Lambeth, London",
  "Street, District, City"
)

df <- data.frame(raw_address, stringsAsFactors = FALSE)

df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')

> df
                               raw_address     distict
1 73 Greenhill Gardens, Wandsworth, London  Wandsworth
2       22 Acacia Heights, Lambeth, London     Lambeth
3                   Street, District, City    District
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1) 
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)

> df
                               raw_address              address    distict   city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2       22 Acacia Heights, Lambeth, London    22 Acacia Heights    Lambeth London
3                   Street, District, City               Street   District   City