有没有一种删除以数字开头,以R中的大写字母结尾的子字符串的一般方法

有没有一种删除以数字开头,以R中的大写字母结尾的子字符串的一般方法,r,regex,gsub,R,Regex,Gsub,很难描述,但基本上,我正在尝试找到一种通用的方法,它可以: [1]“在烤架(1)95 E Kennedy BlvdLakewood,NJ 08701(732)942-6555餐厅,我在这家餐厅与其他5个人共进了商务晚餐。每个人都对他们的开胃菜和主菜感到满意。我们一定会回来的……” [2] “寿司在线231新泽西州第三斯达克伍德08701(732)719-2275餐厅寿司酒吧在线订购” 为此: [1]“肯尼迪大道东95号” [2] “第三街231号” 使用R。我知道它涉及正则表达式,但我没有我

很难描述,但基本上,我正在尝试找到一种通用的方法,它可以:

[1]“在烤架(1)95 E Kennedy BlvdLakewood,NJ 08701(732)942-6555餐厅,我在这家餐厅与其他5个人共进了商务晚餐。每个人都对他们的开胃菜和主菜感到满意。我们一定会回来的……”
[2] “寿司在线231新泽西州第三斯达克伍德08701(732)719-2275餐厅寿司酒吧在线订购”
为此:

[1]“肯尼迪大道东95号”
[2] “第三街231号”
使用R。我知道它涉及正则表达式,但我没有我想要的那么流利


谢谢

您的预期输出没有非常可靠的逻辑,但是查看您的预期数据,您可以使用此正则表达式实现您正在尝试的功能

^.*?(\d{2,}.*?[a-z])[A-Z].*
并将其替换为
\1
,因为group1捕获您想要的文本

sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")

sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
像你想象的那样

[1] "95 E Kennedy Blvd"
[1] "231 3rd St"
编辑: 好的,
\d{2,}
可能有点依赖于数据,所以这里我们可以使用另一种逻辑,在这里我将仅以一个或多个数字开始捕获
\d+
,但后面有一个或多个空格,而且由于匹配恰好在
Lakewood
之前停止,因此也要使用积极的前瞻
(?=Lakewood)
在正则表达式和更新的更好的正则表达式中,可以使用的是:

^.*?(\d+\s+.*?)(?=Lakewood).*

sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…")
sub("^.*?(\\d{2,}.*?[a-z])[A-Z].*", "\\1", "Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
现在,如果您愿意,您甚至可以使用
str\u match
使用regex
\d+\s+.*(?=Lakewood)
提取文本,使用以下代码行

library(stringr)

str_match("On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…", "\\d+\\s+.*?(?=Lakewood)")
str_match("Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online", "\\d+\\s+.*?(?=Lakewood)")
印刷品

     [,1]               
[1,] "95 E Kennedy Blvd"
     [,1]        
[1,] "231 3rd St"
这本书很好,很一般。但是,如果您觉得有帮助,这里有一种替代方法:

x <- c(" On The Grill(1)95 E Kennedy BlvdLakewood, NJ 08701(732) 942-6555Restaurants I had a business dinner at this restaurant with 5 other people. Everyone was pleased with their appetizers and main courses. We’ll be back for sure…",
       " Sushi Now231 3rd StLakewood, NJ 08701(732) 719-2275RestaurantsSushi BarsWebsiteMenuOrder Online")
street_types <- c("Blvd", "St")
address_pattern <- paste("\\d+ .+?", street_types, collapse = "|")
stringr::str_extract_all(string = x, pattern = address_pattern, simplify = TRUE)
#      [,1]               
# [1,] "95 E Kennedy Blvd"
# [2,] "231 3rd St" 

x这个方法很好地实现了它

(\[\d])(?:.+[^\s\d])((?:\d+\s+)[^\R]+)


格什马克

无论您期望的方法是什么,是什么阻止它匹配
2275餐厅
?@AhmedAbdelhameed我猜有一个带有地址号码的空格。OP很可能希望号码至少是两位数。@AhmedAbdelhameed为什么?地址(至少在我的经验中)通常不以单个字母开头digit@duckmayr因为这似乎是地址的建筑编号部分,AFAIK,不必至少有两位长度。@AhmedAbdelhameed:As-expected输出不需要
(1)
,因此我必须使用
\d{2,}
。否则,建议一个逻辑来实现那个输出,我将使用你们的逻辑。我用一个来尝试。是否可以删除括号内的数字?似乎是一个很好的临时方法!我一定会试试这个。