清除R中出现在列中的逗号分隔的单词

清除R中出现在列中的逗号分隔的单词,r,data-cleaning,R,Data Cleaning,我有一个如下所示的数据帧 df = Number Words 1 A@pple11, Mango , !!!,Banana,... 2 G###,Clutter image, Focus^& yourself,.. 3 .... 这是一个模拟巨大的实际数据帧的小示例。我需要清理它,并创建如下的东西 df = Number Words 1 Apple11,Mango,Banana,... 2

我有一个如下所示的数据帧

df = 
Number    Words
 1        A@pple11, Mango   , !!!,Banana,...
 2        G###,Clutter image, Focus^& yourself,..
 3        ....
这是一个模拟巨大的实际数据帧的小示例。我需要清理它,并创建如下的东西

 df = 
 Number    Words
 1        Apple11,Mango,Banana,...
 2        G,Clutter image, Focus yourself,..
 3        ....
我使用以下方法

   dt_2 <- df[, .(Tokens = unlist(strsplit(Words, split = ' 
   '))), by = Number]

   dt_2$Tokens =  gsub('([[:punct:]])|\\s+','_',dt_2$Tokens)

   dt_2[, Words := tm::scan_tokenizer(Tokens) %>%

     tm::removePunctuation()

  ]

   dt_2[, Stems := tm::stemDocument(Words)]

   dt_2[, .N, by = Words]

   CTP_clean <- dt_2[, .(Words = paste(Words, collapse = ' ')), by = 
   Number]

第二种是空格分隔的单词,它们不再被视为单个实体。任何关于警告和清理的帮助都将是巨大的

我会在
数据表和
strsplit中使用列表列,如下所示:

# load package
require(data.table)

# create example data
test <- data.table(
  Number = 1:3, 
  Words = c(
    "A@pple11, Mango   , !!!,Banana,",
    " G###,Clutter image, Focus^& yourself,..",
    " ...."
  )
)

# split the strings into a list column
test[, Words2 := strsplit(Words, ",")]

# look at the output
# (The elements of the list column are displayed
# comma seperated, don't be confused by that.
test

test$Words2

test$Words2[[1]]

test$Words2[[2]][2]
#加载包
要求(数据表)
#创建示例数据

测试也许以下方法对您有效:

library(splitstackshape)
cSplit(test, "Words", ",", "long")[
  , Words := gsub("[[:punct:]]", "", Words)][
    Words != "", list(Words = toString(Words)), Number]
#    Number                            Words
# 1:      1           Apple11, Mango, Banana
# 2:      2 G, Clutter image, Focus yourself
如果不希望单词之间出现空格,请使用:

paste(Words, collapse = ",")
而不是:

toString(Words)

当然,你可以不用“splitstackshape”——我不会生气的。在这种情况下,您可以执行以下操作:

test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
  , Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
    Words != "", list(Words = toString(Words)), Number]
test[, list(Words = unlist(strsplit(Words, ",", TRUE))), Number][
  , Words := gsub("[[:punct:]]|^\\s+|\\s+$", "", Words)][
    Words != "", list(Words = toString(Words)), Number]