在R的字符串中间或结尾提取一个数字_R

在R的字符串中间或结尾提取一个数字

在R的字符串中间或结尾提取一个数字,r,R,我有一个字符串向量。我想在“摊位”之后提取一个数字，这些数字位于字符串的中间或结尾。 x <- c("1345 W. Pacific Coast Highway Wilmington 90710 County: Los Angeles Date Updated: 6/25/2013 Latitude:-118.28079400 Longitude:33.79077900

我有一个字符串向量。我想在“摊位”之后提取一个数字，这些数字位于字符串的中间或结尾。

x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")

我们从字符串的开头（

）开始匹配一个或多个非

（

[^}]+

）字符，后跟一个

，后跟零个或多个非数字的字符（

[^0-9]*

），后跟一个或多个作为组捕获的数字（

[0-9]+

），后跟其他字符（

），并将其替换为捕获组的反向引用（

\\1

）

as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244  40

如果字符串更具体，那么我们可以指定它

as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244  40

有很多方法可以解决这个问题，我将使用

stringr

包来解决它。第一个

stru extract

将获取以下值： [1] “#of Stalls:244”“#of Stalls:40”，然后第二个

str_extract

提取字符串中唯一可用的数字部分

x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")

但是，我不清楚您是要提取字符串还是替换字符串。如果您需要extarct，下面的字符串将适合您。如果要替换字符串，则需要使用

str\u replace

library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))

如果要替换字符串，则应执行以下操作：

str_replace(x,"#\\D*(\\d{1,})","\\1")

输出：

 > as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
    [1] 244  40

> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"    
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"

提取输出：

 > as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
    [1] 244  40

> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"    
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"

替换的输出：

 > as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
    [1] 244  40

> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"    
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"

>str_替换（x，“#\\D*（\\D{1，}）”，“\\1”）
[1] “1345西太平洋海岸公路威尔明顿90710县：洛杉矶更新日期：2013年6月25日纬度：-118.28079400
经度：33.79077900
244
费用：免费”
[2] “La Puente Ave 20601
胡桃木91789
县：洛杉矶
更新日期：2007年6月18日
纬度：-117.859972
经度：34.017513
所有者：教堂
运营商：加州运输局
40”

由于它是HTML，您可以使用rvest或其他HTML解析器首先提取所需的节点，这使得提取数字变得微不足道。XPath选择器和函数为此类工作提供了比CSS选择器和函数多一点的灵活性

库（rvest）
x%>%粘贴（折叠='
'）%>%
读取html（）%>%
html_节点（xpath='//text（）[包含（，“#个暂停：”）]'）%>%
html_text（）%>%
readr:：parse_number（）
#> [1] 244  40

以下是一些解决方案。（1）和（1a）是问题代码的变体。（2）和（2a）采取相反的方法，不是移除我们不想要的，而是移除我们想要的

1）gsub问题中的代码删除数字后面的部分，但不删除数字后面的部分。我们可以在下面将其修改为同时执行这两项操作。我们添加的

\\D.*$

部分就是这样做的。请注意，

“\\D”

匹配任何非数字

as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx))
## [1] 244  40

1a）sub在两个单独的

sub

调用中分别执行这些操作。内部子项来自问题，外部子项删除数字后的第一个非数字

as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx)))
## [1] 244  40

2）strcapture使用R开发版本中提供的这种方法，我们可以大大简化正则表达式。我们使用捕获组（括号中的部分）指定匹配项

strcapture

将返回与捕获组对应的部分，并从中创建data.frame。第三个参数是一个原型结构，它用来知道它应该返回整数。请注意，

“\\d”

匹配任何数字

strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer()))
##   stalls
## 1    244
## 2     40

2a）Straplygsubfn包中的Straply函数类似于

strcapture

，但使用应用范例，其中第一个参数是输入字符串，第二个是模式，第三个是应用于捕获组的函数

library(gsubfn)

strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE)
## [1] [1] 244  40

注意：使用的输入

xx

与问题中的

相同：

xx <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", 
"20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40"
)

xx