String 删除标点符号但保留表情符号？_String_R_Text_Gsub_Emoticons

String 删除标点符号但保留表情符号？

string r text

String 删除标点符号但保留表情符号？,string,r,text,gsub,emoticons,String,R,Text,Gsub,Emoticons,是否可以删除所有标点符号，但保留以下表情符号 :-( ：）：D ：p 1.工作纯正则表达式解决方案（又称编辑#2）这个任务完全可以用正则表达式来完成（非常感谢@Mike Samuel）首先，我们建立一个表情符号数据库： (emots <- as.character(outer(c(":", ";", ":-", ";-"), + c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste))) #

是否可以删除所有标点符号，但保留以下表情符号

:-(

：）

：D

：p

1.工作纯正则表达式解决方案（又称编辑#2）这个任务完全可以用正则表达式来完成（非常感谢@Mike Samuel）
首先，我们建立一个表情符号数据库：

(emots <- as.character(outer(c(":", ";", ":-", ";-"), + c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste))) ## [1] ":)" ";)" ":-)" ";-)" ":(" ";(" ":-(" ";-(" ":]" ";]" ":-]" ";-]" ":[" ";[" ":-[" ";-[" ":D" ";D" ":-D" ";-D" ## [21] ":o" ";o" ":-o" ";-o" ":O" ";O" ":-O" ";-O" ":P" ";P" ":-P" ";-P" ":p" ";p" ":-p" ";-p"

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")")) ## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\$|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")")) ## [1] "(:\$|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
顺便说一句，如果您只想删除一组选定的字符，请在上面添加例如
[，]
而不是
[\\p{p}]
2.正则表达式解决方案提示-我的第一次（非明智）尝试（又称原始答案）我的第一个想法（主要是出于“历史原因”）是通过使用来解决这个问题，但正如你所看到的，这还远远不够完美
要删除所有
：
和
后面不跟），（，D ，X ，8 ，[ ，或] 使用负片后视： stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "") ## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!" 现在我们可以添加一些老式的表情符号（带鼻子，例如：-），；-D 等）现在删除连字符（负片向后看和向前看）一种辅助函数，用于转义某些特殊字符，以便在正则表达式中使用： escape_regex <- function(r) { library("stringi") stri_replace_all_regex(r, "\$|\$|\\[|\\]", "\\\\$0") } 由于一些标点符号出现在表情符号中，我们不应将其移除： which_punct_omit <- sapply(1:nrow(where_punct), function(i) { any(where_punct[i,1] >= where_emots[,1] & where_punct[i,2] <= where_emots[,2]) }) where_punct <- where_punct[!which_punct_omit,] # update where_punct print(where_punct) ## start end ## [1,] 27 27 ## [2,] 38 38 ## [3,] 39 39 ## [4,] 40 40 ## [5,] 46 46 ## [6,] 54 54 ## [7,] 58 58 ## [8,] 60 60 ## [9,] 71 71 ## [10,] 72 72 ## [11,] 73 73 ## [12,] 99 99 ## [13,] 107 107 结果是： stri_enc_fromutf32(text_tmp) ## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy" 这就是。这是一种比@gagolews的解决方案更不复杂、更慢的方法。它需要你给它一本表情词典。您可以创建或使用qdapDictionaries 包中的一个。基本方法是将表情符号转换为不会被误认为其他任何东西的文本（我使用dat$Temp我将此功能添加到qdap version>2.0.0 中，作为sub_holder 函数。基本上，此函数使用，但减轻了编码负担。sub_holder 函数获取文本向量和要细分的项目（如表情符号）。它返回一个包含以下内容的列表：为占位符指定子项的测试向量一个函数（称为unhold ），用于交换原始条款的持有者代码如下： emos <- c(":-(", ":)", ":D", ":p", "X-(") (m <- sub_holder(emos, dat[,1])) m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?"))) emos使用可能会使这类任务变得更简单。它会根据需要自动转义字符，如果放入或（）函数中，则会自动转义向量的所有元素。使用全局参数重新匹配（），将获得给定行的所有表情的列表 x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(", "@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?", "@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", "RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", "xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", "You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-(" ), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", "3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", "created"), class = "data.frame", row.names = c(NA, -6L)) emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0)) library(rex) re_matches(x$text, rex( capture(name = 'emoticons', or(emots) ), global = T) #>[[1]] #> emoticon #>1 :D #>2 :D #> #>[[2]] #> emoticon #>1 <NA> #> #>[[3]] #> emoticon #>1 :-( #> #>[[4]] #> emoticon #>1 <NA> #> #>[[5]] #> emoticon #>1 :p #> #>[[6]] #> emoticon #>1 :) x=structure（list（text=structure）（c（4L，6L，1L，2L，5L，3L），.Label=c（墽岥岥岥岥岥岥岥岥）， “@AirAsia您的直接借记（Maybank）支付网关不工作。这是您正在努力解决的问题吗？”， “@AirAsia除了从普吉岛返回途中的轻微延误和食物短缺外，两次航班都非常顺利。荣誉：）”， “RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT， “xdek ke航班@AirAsia Malaysia飞往洛杉矶……哈哈……p bagi LA promo murah2 sikit，kompom aku beli……”， “当客户服务要求您等待103分钟，而您的号码是42时，您知道有问题。X-（” )，class=“factor”），创建=结构（c（5L，4L，4L，3L，2L， 1L），.Label=c（“2014年1月2日16:14”、“2014年1月2日17:00”、“2014年3月2日0:54”， “2014年3月2日0:58”，“2014年3月2日1:28”），class=“factor”），.Names=c（“文本”， “已创建”），class=“data.frame”，row.names=c（NA，-6L）） emots[[1]] #>表情符号 #>1:D #>2:D #> #>[[2]] #>表情符号 #>1 #> #>[[3]] #>表情符号 #>1 :-( #> #>[[4]] #>表情符号 #>1 #> #>[[5]] #>表情符号 #>1:p #> #>[[6]] #>表情符号 #>1 :) 首先标记表情（gsub 使用$SMILEY1这样的标记标记它们），然后删除标点符号，然后用表情符号替换表情符号show，如果对于某些表情符号，例如=（我将收到错误消息“error in gsub（=”，“emoticons”，Data_edited_txt$text）：无效的正则表达式“=”（”，原因是“缺失”）@user3456230由于（应该在正则表达式中转义，请参阅my中的转义\u regex 函数。@Richard，它们非常重要。它们代表一个人试图将情感、手势和肢体语言的具体表现重新注入虚拟空间。它们携带大量信息。如果这里有人在网上开设迪斯科课程，请注意我想让你停下来，因为你对对话进行了极大的限制，以确保没有真正的对话。这就像让学生在课堂上不能使用面部表情、眼神、移动身体或手势一样。@RichardScriven抱歉，没有不敬，这说明了书面语言的重要性ge很难理解人们的语气。你把我激动的激动误解为冒犯：-）这是一个很好的解决方案，但是第三个字符串中的：-（）呢…？我想他想保留它。这就是它存在的原因。@Tylerinker，谢谢你的伟大解决方案！在这种情况下，可以保留感叹号（！）和问号（？）还有？我需要用于情绪分析的数据集，这是一个非常好的创造性解决方案。OPC中的原始数据集为什么不匹配（表情符号）|标点符号（注意表情符号在捕捉组中）然后替换为只匹配表情符号的捕获组1的结果？因为| 贪婪且左倾，表情符号会很早就匹配并且经常匹配。太好了！你能用这个想法发布一个答案吗？我有一些R代码使用这个正则表达式功能并按预期工作。我不想因此而得到赞扬：）我对R知之甚少，所以我不认为我会通过实际编写代码来添加任何内容，而这个答案通过以可维护的方式定义emots已经做出了很大贡献 text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!" escape_regex <- function(r) { library("stringi") stri_replace_all_regex(r, "\$|\$|\\[|\\]", "\\\\$0") } (regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")")) ## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)" where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text print(where_emots) ## start end ## [1,] 1 2 ## [2,] 4 5 ## [3,] 7 8 ## [4,] 10 11 ## [5,] 13 14 ## [6,] 16 17 ## [7,] 23 24 ## [8,] 64 65 ## [9,] 67 69 where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]] print(where_punct) ## start end ## [1,] 1 1 ## [2,] 2 2 ## [3,] 4 4 ## [4,] 7 7 ## [5,] 8 8 ## ... ## [26,] 72 72 ## [27,] 73 73 ## [28,] 99 99 ## [29,] 107 107 which_punct_omit <- sapply(1:nrow(where_punct), function(i) { any(where_punct[i,1] >= where_emots[,1] & where_punct[i,2] <= where_emots[,2]) }) where_punct <- where_punct[!which_punct_omit,] # update where_punct print(where_punct) ## start end ## [1,] 27 27 ## [2,] 38 38 ## [3,] 39 39 ## [4,] 40 40 ## [5,] 46 46 ## [6,] 54 54 ## [7,] 58 58 ## [8,] 60 60 ## [9,] 71 71 ## [10,] 72 72 ## [11,] 73 73 ## [12,] 99 99 ## [13,] 107 107 text_tmp <- stri_enc_toutf32(text)[[1]] print(text_tmp) # here - just ASCII codes... ## [1] 58 41 32 59 80 32 58 93 32 58.... text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty! stri_enc_fromutf32(text_tmp) ## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy" library(qdap) #reps <- emoticon emos <- c(":-(", ":)", ":D", ":p", "X-(") reps <- data.frame(seq_along(emos), emos) reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1]) dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1]) dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE)) truncdf(left_just(dat[, 3, drop=F]), 50) ## Temp ## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No ## 2 You know there is a problem when customer service ## 3 ãããæããããéãããæãããInappropriate announce:-( ## 4 AirAsia your direct debit Maybank payment gateways ## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi ## 6 AirAsia Apart from the slight delay and shortage o dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?"))) emos <- c(":-(", ":)", ":D", ":p", "X-(") (m <- sub_holder(emos, dat[,1])) m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?"))) x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(", "@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?", "@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", "RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", "xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", "You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-(" ), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", "3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", "created"), class = "data.frame", row.names = c(NA, -6L)) emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0)) library(rex) re_matches(x$text, rex( capture(name = 'emoticons', or(emots) ), global = T) #>[[1]] #> emoticon #>1 :D #>2 :D #> #>[[2]] #> emoticon #>1 <NA> #> #>[[3]] #> emoticon #>1 :-( #> #>[[4]] #> emoticon #>1 <NA> #> #>[[5]] #> emoticon #>1 :p #> #>[[6]] #> emoticon #>1 :)