R 有没有一种方法可以用已知的格式剔除观测值?

R 有没有一种方法可以用已知的格式剔除观测值?,r,dplyr,tidyr,R,Dplyr,Tidyr,我正在使用一个包含40多个变量的数据库。每个案例都有其属性的唯一标识符。其中一些标识符已输入到地址变量中 标识符只能采用以下格式: NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9 SA123456 - First 2 letters constant - SA, 6 Numbers 0-9 MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0

我正在使用一个包含40多个变量的数据库。每个案例都有其属性的唯一标识符。其中一些标识符已输入到地址变量中

标识符只能采用以下格式:

NA123456 - First letter constant - N, 1 Letter A-K, Numbers 1-9
SA123456 - First 2 letters constant - SA, 6 Numbers 0-9
MABC1234 - First letter constant - M, 3 Letters A-Z, 4 Numbers 0-9
QABC1234 - First letter constant - Q, 3 Letters A-Z, 4 Numbers 0-9
WABC1234 - First letter constant - W, 3 Letters A-Z, 4 Numbers 1-9
TABC1234 - First letter constant - T, 3 Letters A-Z, 4 Numbers 1-9
3ABCD123 - First number constant - 3, 3 Letters A-Z, 3 Numbers 1-9
我不确定如何在不创建查找表和使用left_join的情况下从地址文本中删除唯一标识符。查找表需要不断更新,这使得它非常麻烦

我还没有找到这样的例子。不过我可能错过了什么

我的数据如下所示:

Property                        Address               `Aa reference`
   <chr>                           <chr>                 <lgl>         
 1 PIC: 3WABG086                   260 SPRINGHURST ROAD  NA            
 2 PIC: 35PSR217                   1350 RIVER ROAD       NA            
 3 PIC# NH244157                   1038 QUONDONG ROAD    NA            
 4 PIC: 3GMUF425                   70 DIGBY ROAD         NA            
 5 PIC# 3GMUF425                   70 DIGBY ROAD         NA            
 6 PIC QTIWW0626                   REMOLEA               NA            
 7 PIC#EBWSE235                    BOX 191               NA            
 8 PIC #3WLKM019                   198 MONTGOMERY ROAD   NA            
 9 PIC # 3BWMM021                  149 ANDERSONS ROAD    NA            
10 PIC: 3WCGN034                   WERRIBEE              NA            
11 GARANGULA PIC: NH630488         PO BOX 84             NA            
12 GARANGULA PIC: NH630488         PO BOX 84             NA            
13 PIC: 3GMTL320                   2980 GLENELG HIGHWAY  NA            
14 GREENSLOPES PIC: MJKE0261       914 WEST KENTISH ROAD NA            
15 PIC: WFZB3246                   859 PFEIFFER ROAD     NA            
16 PIC: WFAY3549                   34605 ALBANY HIGHWAY  NA            
17 PIC: 3CEXK044                   2244 LAVERS HILL ROAD NA            
18 PIC: QGWW0462                   ELDERFIELD            NA            
19 PIC: 3WCGN034                   WERRIBEE              NA            
20 KAYA DORPER & WHITE DORPER STUD PIC: WABN0262         NA            
21 SPOTTSWOOD                      PIC QKDR0078          NA            
22 COOMBOONA HOLSTEINS             PIC 3SPSR217          NA            
23 ROSEVALE                        PIC: QKEV0169         NA            
24 NA                              PIC 3EGON009          NA            
25 NA                              PIC WFKPO316          NA            
26 IVADENE                         PIC 3WANP0T1          NA            
27 NA                              PIC ND225813          NA            
28 HEAVENLY VALLEY FARMS           PIC #NF538645         NA            
29 C/- CED WISE AB CENTRE          PIC: QCST0158         NA            
30 GARANGULA                       PIC # NH630488        NA
属性地址'Aa参考`
图1:3WABG086北卡罗来纳州斯普林赫斯特路260号
图2:35PSR217北卡罗来纳州河路1350号
3图#NH244157 1038北卡罗来纳州昆东路
图4:3GMUF425北卡罗来纳州迪比路70号
图5#3GMUF425北卡罗来纳州迪格比路70号
6图QTIW0626 REMOLEA NA
7图#EBWSE235信箱191 NA
8图3WLKM019北卡罗来纳州蒙哥马利路198号
图9#3BWMM021北卡罗来纳州安德森路149号
10图:3WCGN034 WERRIBEE NA
11 GARANGULA图片:NH630488邮政信箱84 NA
12 GARANGULA图片:NH630488邮政信箱84 NA
图13:3GMTL320 2980格伦埃尔格公路北卡罗来纳州
14 GREENSLOPES图片:MJKE0261 914肯特郡西路
图15:WFZB3246 859北卡罗来纳州菲弗路
图16:WFAY3549 34605北卡罗来纳州奥尔巴尼公路
图17:3CEXK044 2244北卡罗来纳州拉弗斯山路
图18:QGWW0462埃尔德菲尔德北卡罗来纳州
图19:3WCGN034 WERRIBEE NA
20 KAYA DORPER&WHITE DORPER螺柱图:WABN0262 NA
21 SPOTTSWOOD PIC QKDR0078 NA
22库姆博纳霍尔斯泰因斯图3SPSR217 NA
23罗斯维尔图片:QKEV0169 NA
24 NA图3EGON009 NA
25 NA PIC WFKPO316 NA
26 IVADENE PIC 3WANP01NA
27 NA图ND225813 NA
28天堂谷农场图#NF538645 NA
29 C/-CED WISE AB中心图片:QCST0158 NA
30加里安格拉图片#NH630488 NA
干净的数据将以
aa reference
列中的唯一标识符结束,并且不会用NA覆盖数据位于正确变量中的观察值


非常感谢您的帮助。

一个可能的答案,使用regex模式和
stringr::str\u extract\u all()

我想你的数字应该是0-9,而不是1-9。如果没有,请将所有
[0-9]
更改为
[1-9]

此外,如果要查找特定数量(例如:n)的字母/数字重复,请将
+
更改为
{n}
,就像
vec
中的第一个模式一样

library( data.table )
library( stringr )

# NA123456 - First letter constant - N, Letter A-K, Numbers 1-9
# SA123456 - First 2 letters constant - SA, Numbers 1-9
# MABC1234 - First letter constant - M, Letters A-Z, Numbers 1-9
# QABC1234 - First letter constant - Q, Letters A-Z, Numbers 1-9
# WABC1234 - First letter constant - W, Letters A-Z, Numbers 1-9
# TABC1234 - First letter constant - T, Letters A-Z, Numbers 1-9
# 3ABCD123 - First number constant - 3, Letters A-Z, Numbers 1-9

#create a vector with all regex-patterns
#I assumed 1-9 should be 0-9 ??             <-- !!
vec <- c( "N[A-K]{1}[0-9]+", 
          "SA[0-9]+",
          "M[A-Z]+[0-9]+",
          "Q[A-Z]+[0-9]+",
          "W[A-Z]+[0-9]+",
          "T[A-Z]+[0-9]+",
          "3[A-Z]+[0-9]+" )
#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
#extract all patterns from the column 'Property', and put (as vector) in Aa-reference
DT[, Aa_reference := stringr::str_extract_all( Address, pattern )]
库(data.table)
图书馆(stringr)
#NA123456-第一个字母常量-N,字母A-K,数字1-9
#SA123456-前两个字母常量-SA,数字1-9
#MABC1234-第一个字母常数-M,字母A-Z,数字1-9
#QABC1234-第一个字母常量-Q,字母A-Z,数字1-9
#WABC1234-第一个字母常量-W,字母A-Z,数字1-9
#表C1234-第一个字母常数-T,字母A-Z,数字1-9
#3ABCD123-第一个数字常量-3,字母A-Z,数字1-9
#创建包含所有正则表达式模式的向量

#我假设1-9应该是0-9 这最终奏效了:

vec <- c( "N[A-K]{1}[0-9]+", 
          "SA[0-9]+",
          "M[A-Z]+[0-9]+",
          "Q[A-Z]+[0-9]+",
          "W[A-Z]+[0-9]+",
          "T[A-Z]+[0-9]+",
          "3[A-Z]+[0-9]+" )

#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )

df <- df %>%
  mutate(`id1` = str_extract_all(`Property`, vec),
         `id2` = str_extract_all(`Address`, vec),
         `id1` = na_if(`Pic1`, "character(0)"),
         `id2` = na_if(`Pic2`, "character(0)")
  ) %>% 
  unite(id3, id1, id2, remove = TRUE, sep = " ") %>% 
  mutate(`id3` = str_extract_all(id3, vec),
         `id3` = na_if(`id3`, "character(0)"))
vec%
突变(`id3`=str\u extract\u all(id3,vec),
`id3`=na_if(`id3`,“字符(0)”)

请通过添加
dput(yourData[1:30,])的输出来编辑您的问题。
ID中有固定的字母或数字吗?例如,如果Q是一个有效的ID,它是否总是后跟3个字母、4个数字?如果是这样的话,检查身份证就容易多了。你说数字1-9。。许多行包含nu,ber 0。他们应该被排除在外吗?另外:搜索字符串的长度是否固定?@Marius,是的,这是正确的,我已经更新了问题。@Wimpel,我已经更新了问题以包含0-9。我认为问题要求从地址列中提取ID(因为它们实际上不应该在那里,这似乎是清理数据过程的一部分)@Wimpel谢谢你迄今为止的帮助。如果我当前将数据帧作为TIBLE,我必须将其转换为
数据表
,还是可以将其作为
TIBLE
保存并更改代码?谢谢。关键是创建一个模式,然后使用
str\u extract\u all()
-函数。您可以在data.table上执行此操作(如在我的示例中),但也可以使用通常使用的任何其他方式创建新列。@Wimpel,我在新列中获得大量
字符(0)
输出。当我尝试将3个新列组合在一起时,它们覆盖了
Aa参考中的原始数据。我能按等级做联合收割机吗?或者我最好在尝试使用
联合之前删除
字符(0)
DT <- fread('
Property |                       Address |              Aa_reference
PIC: 3WABG086|                   260 SPRINGHURST ROAD|  NA            
PIC: 35PSR217|                   1350 RIVER ROAD      | NA            
PIC# NH244157|                   1038 QUONDONG ROAD    |NA            
PIC: 3GMUF425|                   70 DIGBY ROAD|         NA            
PIC# 3GMUF425|                   70 DIGBY ROAD |        NA            
PIC QTIWW0626 |                  REMOLEA        |       NA            
PIC#EBWSE235   |                 BOX 191         |      NA            
PIC #3WLKM019   |                198 MONTGOMERY ROAD|   NA            
PIC # 3BWMM021   |               149 ANDERSONS ROAD  |  NA            
PIC: 3WCGN034     |              WERRIBEE             | NA            
GARANGULA PIC: NH630488|         PO BOX 84             |NA            
GARANGULA PIC: NH630488 |        PO BOX 84|             NA            
PIC: 3GMTL320|                   2980 GLENELG HIGHWAY|  NA            
GREENSLOPES PIC: MJKE0261|       914 WEST KENTISH ROAD| NA            
PIC: WFZB3246           |        859 PFEIFFER ROAD|     NA            
PIC: WFAY3549|                   34605 ALBANY HIGHWAY|  NA            
PIC: 3CEXK044 |                  2244 LAVERS HILL ROAD| NA            
PIC: QGWW0462  |                 ELDERFIELD|            NA            
PIC: 3WCGN034   |                WERRIBEE|              NA            
KAYA DORPER & WHITE DORPER STUD| PIC: WABN0262|         NA            
SPOTTSWOOD|                      PIC QKDR0078  |        NA            
COOMBOONA HOLSTEINS|             PIC 3SPSR217   |       NA            
ROSEVALE            |            PIC: QKEV0169   |      NA            
NA|                              PIC 3EGON009     |     NA            
NA |                             PIC WFKPO316      |    NA            
IVADENE|                         PIC 3WANP0T1       |   NA            
NA      |                        PIC ND225813        |  NA            
HEAVENLY VALLEY FARMS|           PIC #NF538645        | NA            
C/- CED WISE AB CENTRE|          PIC: QCST0158         |NA            
GARANGULA|                       PIC # NH630488        |NA
', sep = "|")
vec <- c( "N[A-K]{1}[0-9]+", 
          "SA[0-9]+",
          "M[A-Z]+[0-9]+",
          "Q[A-Z]+[0-9]+",
          "W[A-Z]+[0-9]+",
          "T[A-Z]+[0-9]+",
          "3[A-Z]+[0-9]+" )

#paste patterns together to one large regex-OR-pattern
pattern <- paste( vec, collapse = "|" )

df <- df %>%
  mutate(`id1` = str_extract_all(`Property`, vec),
         `id2` = str_extract_all(`Address`, vec),
         `id1` = na_if(`Pic1`, "character(0)"),
         `id2` = na_if(`Pic2`, "character(0)")
  ) %>% 
  unite(id3, id1, id2, remove = TRUE, sep = " ") %>% 
  mutate(`id3` = str_extract_all(id3, vec),
         `id3` = na_if(`id3`, "character(0)"))