仅从R中的列中筛选特定字符串
我有一个数据集,其中的列具有逗号分隔的值。我需要解析此列中的每个值,只保留特定值,删除其他值 我拥有的代码和数据如下:仅从R中的列中筛选特定字符串,r,strsplit,R,Strsplit,我有一个数据集,其中的列具有逗号分隔的值。我需要解析此列中的每个值,只保留特定值,删除其他值 我拥有的代码和数据如下: myDf <- structure(list(GeogPreferences = structure(1:4, .Label = c("Central and East Europe, Europe, North America, West Europe, US", "Europe, North America, West Europe, US", "Global, N
myDf <- structure(list(GeogPreferences = structure(1:4, .Label = c("Central and East Europe, Europe, North America, West Europe, US",
"Europe, North America, West Europe, US", "Global, North America",
"Northeast, Southeast, West, US"), class = "factor")), .Names = "GeogPreferences", class = "data.frame", row.names = c(NA,
-4L))
regionInterest <- c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
z <- ifelse((z %in% regionInterest), z[z %in% regionInterest], z)
})
myDf$GeogPreferences<-unlist(k)
如果列在regionInterest的字符串上有任何内容,我希望保留该字符串,否则我希望删除它
我期望的结果是:
GeogPreferences
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
有人能帮我做错事吗?谢谢 您得到的错误是由于strsplit创建的行数多于您的输入df。同样,在ifelse语句中,您返回的z为FALSE,因此它没有按照您的意图执行 这里有一个tidyr/dplyr解决方案
myDf %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
# A tibble: 4 × 1
GeogPreferences
<chr>
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
您得到的错误是由于strsplit创建的行比您的输入df多。同样,在ifelse语句中,您返回的z为FALSE,因此它没有按照您的意图执行 这里有一个tidyr/dplyr解决方案
myDf %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
# A tibble: 4 × 1
GeogPreferences
<chr>
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
您可能应该首先拆分数据,然后才运行子集。
这将提高效率,因为strsplit是矢量化的,每个分割中的矢量大小无关紧要。而且,在trimws中不需要,它只会使代码效率低下。相反,请在指定fixed=TRUE的同时打开split。这将使strsplit的工作速度提高约X10倍,因为它不会使用正则表达式进行拆分
以下内容仅适用于base R
do.call(rbind, # you can use `rbind.data.frame` instead if you don't want a matrix
lapply(strsplit(as.character(myDf$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
# [,1]
# [1,] "North America, US"
# [2,] "North America, US"
# [3,] "North America"
# [4,] "Northeast, Southeast, West, US"
尽管上述解决方案与您自己的解决方案类似,但仍然是一个划行的解决方案。相反,我们可以尝试通过按列操作来实现相同的效果。从列的角度来看,我的意思是,如果我们使用转置拆分,迭代次数将是myDf$GeogPreferences中最长句子的大小。我们拆分的逗号的数量应该大大小于数据中的行数
这是一个使用data.table::tstrsplit的说明
下面是关于100K行数据集的简单基准测试
bigDF <- myDf[sample(nrow(myDf), 1e5, replace = TRUE),, drop = FALSE]
library(dplyr)
library(tidyr)
library(data.table)
tidyverse <- function(x) {
x %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
}
MF <- function(x) {
k <- lapply(as.character(x$GeogPreferences), function(x) {
z <- trimws(unlist(strsplit(x, split = ",")))
z <- z[z %in% regionInterest]
})
sapply(k, paste, collapse = ", ")
}
DA1 <- function(x) {
do.call(rbind,
lapply(strsplit(as.character(x$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
}
DA2 <- function(x) {
tmp <- data.table::tstrsplit(x$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
}
system.time(tidyverse(bigDF))
# user system elapsed
# 17.67 0.01 17.91
system.time(MF(bigDF))
# user system elapsed
# 15.52 0.00 15.70
system.time(DA1(bigDF))
# user system elapsed
# 0.97 0.00 1.00
system.time(DA2(bigDF))
# user system elapsed
# 0.25 0.00 0.25
因此,其他两个解决方案的运行时间超过了15秒,而我的两个解决方案的运行时间都不到一秒钟。您可能应该首先拆分数据,然后再运行子集。
这将提高效率,因为strsplit是矢量化的,每个分割中的矢量大小无关紧要。而且,在trimws中不需要,它只会使代码效率低下。相反,请在指定fixed=TRUE的同时打开split。这将使strsplit的工作速度提高约X10倍,因为它不会使用正则表达式进行拆分
以下内容仅适用于base R
do.call(rbind, # you can use `rbind.data.frame` instead if you don't want a matrix
lapply(strsplit(as.character(myDf$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
# [,1]
# [1,] "North America, US"
# [2,] "North America, US"
# [3,] "North America"
# [4,] "Northeast, Southeast, West, US"
尽管上述解决方案与您自己的解决方案类似,但仍然是一个划行的解决方案。相反,我们可以尝试通过按列操作来实现相同的效果。从列的角度来看,我的意思是,如果我们使用转置拆分,迭代次数将是myDf$GeogPreferences中最长句子的大小。我们拆分的逗号的数量应该大大小于数据中的行数
这是一个使用data.table::tstrsplit的说明
下面是关于100K行数据集的简单基准测试
bigDF <- myDf[sample(nrow(myDf), 1e5, replace = TRUE),, drop = FALSE]
library(dplyr)
library(tidyr)
library(data.table)
tidyverse <- function(x) {
x %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
}
MF <- function(x) {
k <- lapply(as.character(x$GeogPreferences), function(x) {
z <- trimws(unlist(strsplit(x, split = ",")))
z <- z[z %in% regionInterest]
})
sapply(k, paste, collapse = ", ")
}
DA1 <- function(x) {
do.call(rbind,
lapply(strsplit(as.character(x$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
}
DA2 <- function(x) {
tmp <- data.table::tstrsplit(x$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
}
system.time(tidyverse(bigDF))
# user system elapsed
# 17.67 0.01 17.91
system.time(MF(bigDF))
# user system elapsed
# 15.52 0.00 15.70
system.time(DA1(bigDF))
# user system elapsed
# 0.97 0.00 1.00
system.time(DA2(bigDF))
# user system elapsed
# 0.25 0.00 0.25
因此,与我的两个解决方案相比,其他两个解决方案的运行时间都超过了15秒。这两个解决方案的运行时间都不到一秒。如果您喜欢更接近您的方法的解决方案,请将其更改为
regionInterest <- c("Americas", "North America", "US",
"Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
# this makes sure you only use z which are in regionInterest
z <- z[z %in% regionInterest]
})
# paste with collapse creates one value out of a vector of string seperated by the collapse argument
myDf$GeogPreferences<-sapply(k, paste, collapse = ", ")
如果您喜欢更接近您的方法的解决方案,我希望这会有所帮助,请将其更改为
regionInterest <- c("Americas", "North America", "US",
"Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
# this makes sure you only use z which are in regionInterest
z <- z[z %in% regionInterest]
})
# paste with collapse creates one value out of a vector of string seperated by the collapse argument
myDf$GeogPreferences<-sapply(k, paste, collapse = ", ")
我希望这有帮助谢谢杰克!!这很有帮助。谢谢你,杰克!!这很有帮助。谢谢你,马克!!这很有帮助。谢谢你,马克!!这很有帮助。谢谢大卫!!这很有帮助。这是一个很好的比较。谢谢你,大卫。如果需要对现有列进行更改并将其保存回数据集,我需要做什么?即使在使用rbind.data.frame之后,我也会出现此错误。错误:每个变量必须是1d原子向量或列表。问题变量:“地理。首选项”只需重新分配。例如,我在重新分配myDf$GeogPreferences时检查了它是否工作正常。我的代码在另一行有问题。谢谢你的帮助。谢谢大卫!!这很有帮助。这是一个很好的比较。谢谢你,大卫。如果需要对现有列进行更改并将其保存回数据集,我需要做什么?即使在使用rbind.data.frame之后,我也会出现此错误。错误:每个变量必须是1d原子向量或列表。问题变量:“地理。首选项”只需重新分配。例如,我在重新分配myDf$GeogPreferences时检查了它是否工作正常。我的代码在另一行有问题。谢谢你的帮助。