R 识别部分字符串匹配_R - Fatal编程技术网

R 识别部分字符串匹配

R 识别部分字符串匹配,r,R,我在df中有一列，其值如下：等等我希望以粗体标识所有值，即列中所有在下划线后有任何字符且同时显示时不带下划线的值。我曾尝试使用gsub在单独的数据帧中获取值列表，但这仍然不能解决问题。任何帮助都将不胜感激这是一个肮脏的解决方案 # Prepare some data dat <- c("1-1", "1-1_2", "1-2", "1-1_3", "1-3", "1-4_1

我在df中有一列，其值如下：

等等

我希望以粗体标识所有值，即列中所有在下划线后有任何字符且同时显示时不带下划线的值。我曾尝试使用gsub在单独的数据帧中获取值列表，但这仍然不能解决问题。任何帮助都将不胜感激

这是一个肮脏的解决方案

# Prepare some data
dat <- c("1-1", "1-1_2", "1-2", "1-1_3", "1-3", "1-4_1")

# Get splited strings
dat_mask <- sapply(dat, function(x) {
    string_l <- strsplit(x, "_")[[1]]
    # Here the string_l has length 2 if the initial string contains "_"
    return(c(string_l[1], length(string_l) > 1))
})

# Get parts which came from a string with underscore
str_parts <- unique(dat_mask[1, ][as.logical(dat_mask[2, ])])

# Get indexes of selected strings
str_ind <- dat_mask[1, ] %in% str_parts

# Get values form initial data
dat[str_ind]

# [1] "1-1"   "1-1_2" "1-1_3" "1-4_1"

#准备一些数据
dat这里有一个数据表
library(data.table)

# Mimic your dataset
dat = data.frame(`Claim Number` = c("1-12835", "1-12835_2", 
"1-12835_3", "1-12835_4", "2", "3", "4", "5", "6-15302", 
"6-15302_2", "7", "8", "9-16186", "9-16186_2"))

# Set the data.frame to data.table
setDT(dat)

# Get the "parent" claim number by removing any characters after the underscore
dat[, parent_claim_number := gsub("_.*", "", Claim.Number)]

# Add an indicator for any parent claim numbers with "sub" claims
dat[, has_sub_claim := any(grepl("_", Claim.Number)), by = .(parent_claim_number)]

结果是：
   Claim.Number parent_claim_number has_sub_claim
 1:      1-12835             1-12835          TRUE
 2:    1-12835_2             1-12835          TRUE
 3:    1-12835_3             1-12835          TRUE
 4:    1-12835_4             1-12835          TRUE
 5:            2                   2         FALSE
 6:            3                   3         FALSE
 7:            4                   4         FALSE
 8:            5                   5         FALSE
 9:      6-15302             6-15302          TRUE
10:    6-15302_2             6-15302          TRUE
11:            7                   7         FALSE
12:            8                   8         FALSE
13:      9-16186             9-16186          TRUE
14:    9-16186_2             9-16186          TRUE

如果希望索赔包含子索赔，可以执行以下操作：
dat[has_sub_claim == TRUE]

dat[has_sub_claim == TRUE & grepl("_", Claim.Number)]

如果只需要子索赔而不需要父索赔，可以执行以下操作：
dat[has_sub_claim == TRUE]

dat[has_sub_claim == TRUE & grepl("_", Claim.Number)]

基本R解决方案：
首先删除下划线后的所有内容，以便比较类似的字符串
x <- c('92030534-12835', '92030534-12835_2', '92030534-12835_3', '13212854-14382', '13668582-14232', '93265773-15302', '93265773-15302_2')
df <- data.frame(x)
df$y <- sub('_.*', '', df$x)
df
#                 x              y
#1   92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#4   13212854-14382 13212854-14382
#5   13668582-14232 13668582-14232
#6   93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302

然后可以将这些行子集化
df[duplicated(df$y) | duplicated(df$y, fromLast = TRUE), ]

#                 x              y
#1   92030534-12835 92030534-12835
#2 92030534-12835_2 92030534-12835
#3 92030534-12835_3 92030534-12835
#6   93265773-15302 93265773-15302
#7 93265773-15302_2 93265773-15302

或者将它们添加为新列
df$z <- duplicated(df$y) | duplicated(df$y, fromLast = TRUE)

df$z您在R中读取数据了吗？在这里共享您的数据。检查输出，让那些试图帮助你的人更容易。不要将数据/代码添加为图像。提供一个可复制的示例以及预期输出。读一下。谢谢你们，我早该知道的！