R 有没有一种方法可以用已知的格式剔除观测值?
我正在使用一个包含40多个变量的数据库。每个案例都有其属性的唯一标识符。其中一些标识符已输入到地址变量中 标识符只能采用以下格式:R 有没有一种方法可以用已知的格式剔除观测值?,r,dplyr,tidyr,R,Dplyr,Tidyr,我正在使用一个包含40多个变量的数据库。每个案例都有其属性的唯一标识符。其中一些标识符已输入到地址变量中 标识符只能采用以下格式: NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9 SA123456 - First 2 letters constant - SA, 6 Numbers 0-9 MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0
NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9
SA123456 - First 2 letters constant - SA, 6 Numbers 0-9
MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0-9
QABC1234 - First letter constant - Q, 3 Letters A-Z, 4 Numbers 0-9
WABC1234 - First letter constant - W, 3 Letters A-Z, 4 Numbers 1-9
TABC1234 - First letter constant - T, 3 Letters A-Z, 4 Numbers 1-9
3ABCD123 - First number constant - 3, 3 Letters A-Z, 3 Numbers 1-9
我不确定如何在不创建查找表和使用left_join的情况下从地址文本中删除唯一标识符。查找表需要不断更新,这使得它非常麻烦
我还没有找到这样的例子。不过我可能错过了什么
我的数据如下所示:
Property Address `Aa reference`
<chr> <chr> <lgl>
1 PIC: 3WABG086 260 SPRINGHURST ROAD NA
2 PIC: 35PSR217 1350 RIVER ROAD NA
3 PIC# NH244157 1038 QUONDONG ROAD NA
4 PIC: 3GMUF425 70 DIGBY ROAD NA
5 PIC# 3GMUF425 70 DIGBY ROAD NA
6 PIC QTIWW0626 REMOLEA NA
7 PIC#EBWSE235 BOX 191 NA
8 PIC #3WLKM019 198 MONTGOMERY ROAD NA
9 PIC # 3BWMM021 149 ANDERSONS ROAD NA
10 PIC: 3WCGN034 WERRIBEE NA
11 GARANGULA PIC: NH630488 PO BOX 84 NA
12 GARANGULA PIC: NH630488 PO BOX 84 NA
13 PIC: 3GMTL320 2980 GLENELG HIGHWAY NA
14 GREENSLOPES PIC: MJKE0261 914 WEST KENTISH ROAD NA
15 PIC: WFZB3246 859 PFEIFFER ROAD NA
16 PIC: WFAY3549 34605 ALBANY HIGHWAY NA
17 PIC: 3CEXK044 2244 LAVERS HILL ROAD NA
18 PIC: QGWW0462 ELDERFIELD NA
19 PIC: 3WCGN034 WERRIBEE NA
20 KAYA DORPER & WHITE DORPER STUD PIC: WABN0262 NA
21 SPOTTSWOOD PIC QKDR0078 NA
22 COOMBOONA HOLSTEINS PIC 3SPSR217 NA
23 ROSEVALE PIC: QKEV0169 NA
24 NA PIC 3EGON009 NA
25 NA PIC WFKPO316 NA
26 IVADENE PIC 3WANP0T1 NA
27 NA PIC ND225813 NA
28 HEAVENLY VALLEY FARMS PIC #NF538645 NA
29 C/- CED WISE AB CENTRE PIC: QCST0158 NA
30 GARANGULA PIC # NH630488 NA
属性地址'Aa参考`
图1:3WABG086北卡罗来纳州斯普林赫斯特路260号
图2:35PSR217北卡罗来纳州河路1350号
3图#NH244157 1038北卡罗来纳州昆东路
图4:3GMUF425北卡罗来纳州迪比路70号
图5#3GMUF425北卡罗来纳州迪格比路70号
6图QTIW0626 REMOLEA NA
7图#EBWSE235信箱191 NA
8图3WLKM019北卡罗来纳州蒙哥马利路198号
图9#3BWMM021北卡罗来纳州安德森路149号
10图:3WCGN034 WERRIBEE NA
11 GARANGULA图片:NH630488邮政信箱84 NA
12 GARANGULA图片:NH630488邮政信箱84 NA
图13:3GMTL320 2980格伦埃尔格公路北卡罗来纳州
14 GREENSLOPES图片:MJKE0261 914肯特郡西路
图15:WFZB3246 859北卡罗来纳州菲弗路
图16:WFAY3549 34605北卡罗来纳州奥尔巴尼公路
图17:3CEXK044 2244北卡罗来纳州拉弗斯山路
图18:QGWW0462埃尔德菲尔德北卡罗来纳州
图19:3WCGN034 WERRIBEE NA
20 KAYA DORPER&WHITE DORPER螺柱图:WABN0262 NA
21 SPOTTSWOOD PIC QKDR0078 NA
22库姆博纳霍尔斯泰因斯图3SPSR217 NA
23罗斯维尔图片:QKEV0169 NA
24 NA图3EGON009 NA
25 NA PIC WFKPO316 NA
26 IVADENE PIC 3WANP01NA
27 NA图ND225813 NA
28天堂谷农场图#NF538645 NA
29 C/-CED WISE AB中心图片:QCST0158 NA
30加里安格拉图片#NH630488 NA
干净的数据将以aa reference
列中的唯一标识符结束,并且不会用NA覆盖数据位于正确变量中的观察值
非常感谢您的帮助。一个可能的答案,使用regex模式和
stringr::str\u extract\u all()
我想你的数字应该是0-9,而不是1-9。如果没有,请将所有[0-9]
更改为[1-9]
此外,如果要查找特定数量(例如:n)的字母/数字重复,请将
+
更改为{n}
,就像vec
中的第一个模式一样
library( data.table )
library( stringr )
# NA123456 - First letter constant - N, Letter A-K, Numbers 1-9
# SA123456 - First 2 letters constant - SA, Numbers 1-9
# MABC1234 - First letter constant - M, Letters A-Z, Numbers 1-9
# QABC1234 - First letter constant - Q, Letters A-Z, Numbers 1-9
# WABC1234 - First letter constant - W, Letters A-Z, Numbers 1-9
# TABC1234 - First letter constant - T, Letters A-Z, Numbers 1-9
# 3ABCD123 - First number constant - 3, Letters A-Z, Numbers 1-9
#create a vector with all regex-patterns
#I assumed 1-9 should be 0-9 ?? <-- !!
vec <- c( "N[A-K]{1}[0-9]+",
"SA[0-9]+",
"M[A-Z]+[0-9]+",
"Q[A-Z]+[0-9]+",
"W[A-Z]+[0-9]+",
"T[A-Z]+[0-9]+",
"3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
DT[, Aa_reference := stringr::str_extract_all( Address, pattern )]
库(data.table)
图书馆(stringr)
#NA123456-第一个字母常量-N,字母A-K,数字1-9
#SA123456-前两个字母常量-SA,数字1-9
#MABC1234-第一个字母常数-M,字母A-Z,数字1-9
#QABC1234-第一个字母常量-Q,字母A-Z,数字1-9
#WABC1234-第一个字母常量-W,字母A-Z,数字1-9
#表C1234-第一个字母常数-T,字母A-Z,数字1-9
#3ABCD123-第一个数字常量-3,字母A-Z,数字1-9
#创建包含所有正则表达式模式的向量
#我假设1-9应该是0-9 这最终奏效了:
vec <- c( "N[A-K]{1}[0-9]+",
"SA[0-9]+",
"M[A-Z]+[0-9]+",
"Q[A-Z]+[0-9]+",
"W[A-Z]+[0-9]+",
"T[A-Z]+[0-9]+",
"3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
df <- df %>%
mutate(`id1` = str_extract_all(`Property`, vec),
`id2` = str_extract_all(`Address`, vec),
`id1` = na_if(`Pic1`, "character(0)"),
`id2` = na_if(`Pic2`, "character(0)")
) %>%
unite(id3, id1, id2, remove = TRUE, sep = " ") %>%
mutate(`id3` = str_extract_all(id3, vec),
`id3` = na_if(`id3`, "character(0)"))
vec%
突变(`id3`=str\u extract\u all(id3,vec),
`id3`=na_if(`id3`,“字符(0)”)
请通过添加dput(yourData[1:30,])的输出来编辑您的问题。
ID中有固定的字母或数字吗?例如,如果Q是一个有效的ID,它是否总是后跟3个字母、4个数字?如果是这样的话,检查身份证就容易多了。你说数字1-9。。许多行包含nu,ber 0。他们应该被排除在外吗?另外:搜索字符串的长度是否固定?@Marius,是的,这是正确的,我已经更新了问题。@Wimpel,我已经更新了问题以包含0-9。我认为问题要求从地址列中提取ID(因为它们实际上不应该在那里,这似乎是清理数据过程的一部分)@Wimpel谢谢你迄今为止的帮助。如果我当前将数据帧作为TIBLE,我必须将其转换为数据表
,还是可以将其作为TIBLE
保存并更改代码?谢谢。关键是创建一个模式,然后使用str\u extract\u all()
-函数。您可以在data.table上执行此操作(如在我的示例中),但也可以使用通常使用的任何其他方式创建新列。@Wimpel,我在新列中获得大量字符(0)
输出。当我尝试将3个新列组合在一起时,它们覆盖了Aa参考中的原始数据。我能按等级做联合收割机吗?或者我最好在尝试使用联合之前删除字符(0)
?
DT <- fread('
Property | Address | Aa_reference
PIC: 3WABG086| 260 SPRINGHURST ROAD| NA
PIC: 35PSR217| 1350 RIVER ROAD | NA
PIC# NH244157| 1038 QUONDONG ROAD |NA
PIC: 3GMUF425| 70 DIGBY ROAD| NA
PIC# 3GMUF425| 70 DIGBY ROAD | NA
PIC QTIWW0626 | REMOLEA | NA
PIC#EBWSE235 | BOX 191 | NA
PIC #3WLKM019 | 198 MONTGOMERY ROAD| NA
PIC # 3BWMM021 | 149 ANDERSONS ROAD | NA
PIC: 3WCGN034 | WERRIBEE | NA
GARANGULA PIC: NH630488| PO BOX 84 |NA
GARANGULA PIC: NH630488 | PO BOX 84| NA
PIC: 3GMTL320| 2980 GLENELG HIGHWAY| NA
GREENSLOPES PIC: MJKE0261| 914 WEST KENTISH ROAD| NA
PIC: WFZB3246 | 859 PFEIFFER ROAD| NA
PIC: WFAY3549| 34605 ALBANY HIGHWAY| NA
PIC: 3CEXK044 | 2244 LAVERS HILL ROAD| NA
PIC: QGWW0462 | ELDERFIELD| NA
PIC: 3WCGN034 | WERRIBEE| NA
KAYA DORPER & WHITE DORPER STUD| PIC: WABN0262| NA
SPOTTSWOOD| PIC QKDR0078 | NA
COOMBOONA HOLSTEINS| PIC 3SPSR217 | NA
ROSEVALE | PIC: QKEV0169 | NA
NA| PIC 3EGON009 | NA
NA | PIC WFKPO316 | NA
IVADENE| PIC 3WANP0T1 | NA
NA | PIC ND225813 | NA
HEAVENLY VALLEY FARMS| PIC #NF538645 | NA
C/- CED WISE AB CENTRE| PIC: QCST0158 |NA
GARANGULA| PIC # NH630488 |NA
', sep = "|")
vec <- c( "N[A-K]{1}[0-9]+",
"SA[0-9]+",
"M[A-Z]+[0-9]+",
"Q[A-Z]+[0-9]+",
"W[A-Z]+[0-9]+",
"T[A-Z]+[0-9]+",
"3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
df <- df %>%
mutate(`id1` = str_extract_all(`Property`, vec),
`id2` = str_extract_all(`Address`, vec),
`id1` = na_if(`Pic1`, "character(0)"),
`id2` = na_if(`Pic2`, "character(0)")
) %>%
unite(id3, id1, id2, remove = TRUE, sep = " ") %>%
mutate(`id3` = str_extract_all(id3, vec),
`id3` = na_if(`id3`, "character(0)"))