清除R中出现在列中的逗号分隔的单词
我有一个如下所示的数据帧清除R中出现在列中的逗号分隔的单词,r,data-cleaning,R,Data Cleaning,我有一个如下所示的数据帧 df = Number Words 1 A@pple11, Mango , !!!,Banana,... 2 G###,Clutter image, Focus^& yourself,.. 3 .... 这是一个模拟巨大的实际数据帧的小示例。我需要清理它,并创建如下的东西 df = Number Words 1 Apple11,Mango,Banana,... 2
df =
Number Words
1 A@pple11, Mango , !!!,Banana,...
2 G###,Clutter image, Focus^& yourself,..
3 ....
这是一个模拟巨大的实际数据帧的小示例。我需要清理它,并创建如下的东西
df =
Number Words
1 Apple11,Mango,Banana,...
2 G,Clutter image, Focus yourself,..
3 ....
我使用以下方法
dt_2 <- df[, .(Tokens = unlist(strsplit(Words, split = '
'))), by = Number]
dt_2$Tokens = gsub('([[:punct:]])|\\s+','_',dt_2$Tokens)
dt_2[, Words := tm::scan_tokenizer(Tokens) %>%
tm::removePunctuation()
]
dt_2[, Stems := tm::stemDocument(Words)]
dt_2[, .N, by = Words]
CTP_clean <- dt_2[, .(Words = paste(Words, collapse = ' ')), by =
Number]
第二种是空格分隔的单词,它们不再被视为单个实体。任何关于警告和清理的帮助都将是巨大的 我会在
数据表和strsplit中使用列表列,如下所示:
# load package
require(data.table)
# create example data
test <- data.table(
Number = 1:3,
Words = c(
"A@pple11, Mango , !!!,Banana,",
" G###,Clutter image, Focus^& yourself,..",
" ...."
)
)
# split the strings into a list column
test[, Words2 := strsplit(Words, ",")]
# look at the output
# (The elements of the list column are displayed
# comma seperated, don't be confused by that.
test
test$Words2
test$Words2[[1]]
test$Words2[[2]][2]
#加载包
要求(数据表)
#创建示例数据
测试也许以下方法对您有效:
library(splitstackshape)
cSplit(test, "Words", ",", "long")[
, Words := gsub("[[:punct:]]", "", Words)][
Words != "", list(Words = toString(Words)), Number]
# Number Words
# 1: 1 Apple11, Mango, Banana
# 2: 2 G, Clutter image, Focus yourself
如果不希望单词之间出现空格,请使用:
paste(Words, collapse = ",")
而不是:
toString(Words)
当然,你可以不用“splitstackshape”——我不会生气的。在这种情况下,您可以执行以下操作:
test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
, Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
Words != "", list(Words = toString(Words)), Number]
test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
, Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
Words != "", list(Words = toString(Words)), Number]