String 删除标点符号但保留表情符号?
是否可以删除所有标点符号,但保留以下表情符号 :-( :) :D :p 1.工作纯正则表达式解决方案(又称编辑#2) 这个任务完全可以用正则表达式来完成(非常感谢@Mike Samuel) 首先,我们建立一个表情符号数据库:String 删除标点符号但保留表情符号?,string,r,text,gsub,emoticons,String,R,Text,Gsub,Emoticons,是否可以删除所有标点符号,但保留以下表情符号 :-( :) :D :p 1.工作纯正则表达式解决方案(又称编辑#2) 这个任务完全可以用正则表达式来完成(非常感谢@Mike Samuel) 首先,我们建立一个表情符号数据库: (emots <- as.character(outer(c(":", ";", ":-", ";-"), + c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste))) #
(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+ c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)" ";)" ":-)" ";-)" ":(" ";(" ":-(" ";-(" ":]" ";]" ":-]" ";-]" ":[" ";[" ":-[" ";-[" ":D" ";D" ":-D" ";-D"
## [21] ":o" ";o" ":-o" ";-o" ":O" ";O" ":-O" ";-O" ":P" ";P" ":-P" ";-P" ":p" ";p" ":-p" ";-p"
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
顺便说一句,如果您只想删除一组选定的字符,请在上面添加例如[,]
而不是[\\p{p}]
2.正则表达式解决方案提示-我的第一次(非明智)尝试(又称原始答案)
我的第一个想法(主要是出于“历史原因”)是通过使用来解决这个问题,但正如你所看到的,这还远远不够完美
要删除所有:
和代码>后面不跟)
,(
,D
,X
,8
,[
,或]
使用负片后视:
stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
现在我们可以添加一些老式的表情符号(带鼻子,例如:-)
,;-D
等)
现在删除连字符(负片向后看和向前看)
一种辅助函数,用于转义某些特殊字符,以便在正则表达式中使用:
escape_regex <- function(r) {
library("stringi")
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
由于一些标点符号出现在表情符号中,我们不应将其移除:
which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
any(where_punct[i,1] >= where_emots[,1] &
where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
## start end
## [1,] 27 27
## [2,] 38 38
## [3,] 39 39
## [4,] 40 40
## [5,] 46 46
## [6,] 54 54
## [7,] 58 58
## [8,] 60 60
## [9,] 71 71
## [10,] 72 72
## [11,] 73 73
## [12,] 99 99
## [13,] 107 107
结果是:
stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
这就是。这是一种比@gagolews的解决方案更不复杂、更慢的方法。它需要你给它一本表情词典。您可以创建或使用qdapDictionaries
包中的一个。基本方法是将表情符号转换为不会被误认为其他任何东西的文本(我使用dat$Temp我将此功能添加到qdap version>2.0.0
中,作为sub_holder
函数。基本上,此函数使用,但减轻了编码负担。sub_holder
函数获取文本向量和要细分的项目(如表情符号)。它返回一个包含以下内容的列表:
为占位符指定子项的测试向量
一个函数(称为unhold
),用于交换原始条款的持有者
代码如下:
emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
emos使用可能会使这类任务变得更简单。它会根据需要自动转义字符,如果放入或()
函数中,则会自动转义向量的所有元素。使用全局参数重新匹配()
,将获得给定行的所有表情的列表
x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D",
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))
library(rex)
re_matches(x$text,
rex(
capture(name = 'emoticons',
or(emots)
),
global = T)
#>[[1]]
#> emoticon
#>1 :D
#>2 :D
#>
#>[[2]]
#> emoticon
#>1 <NA>
#>
#>[[3]]
#> emoticon
#>1 :-(
#>
#>[[4]]
#> emoticon
#>1 <NA>
#>
#>[[5]]
#> emoticon
#>1 :p
#>
#>[[6]]
#> emoticon
#>1 :)
x=structure(list(text=structure)(c(4L,6L,1L,2L,5L,3L),.Label=c(墽岥岥岥岥岥岥岥岥),
“@AirAsia您的直接借记(Maybank)支付网关不工作。这是您正在努力解决的问题吗?”,
“@AirAsia除了从普吉岛返回途中的轻微延误和食物短缺外,两次航班都非常顺利。荣誉:)”,
“RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT,
“xdek ke航班@AirAsia Malaysia飞往洛杉矶……哈哈……p bagi LA promo murah2 sikit,kompom aku beli……”,
“当客户服务要求您等待103分钟,而您的号码是42时,您知道有问题。X-(”
),class=“factor”),创建=结构(c(5L,4L,4L,3L,2L,
1L),.Label=c(“2014年1月2日16:14”、“2014年1月2日17:00”、“2014年3月2日0:54”,
“2014年3月2日0:58”,“2014年3月2日1:28”),class=“factor”),.Names=c(“文本”,
“已创建”),class=“data.frame”,row.names=c(NA,-6L))
emots[[1]]
#>表情符号
#>1:D
#>2:D
#>
#>[[2]]
#>表情符号
#>1
#>
#>[[3]]
#>表情符号
#>1 :-(
#>
#>[[4]]
#>表情符号
#>1
#>
#>[[5]]
#>表情符号
#>1:p
#>
#>[[6]]
#>表情符号
#>1 :)
首先标记表情(gsub
使用$SMILEY1这样的标记标记它们),然后删除标点符号,然后用表情符号替换表情符号show,如果对于某些表情符号,例如=(我将收到错误消息“error in gsub(=”,“emoticons”,Data_edited_txt$text):无效的正则表达式“=”(”,原因是“缺失”)@user3456230由于(
应该在正则表达式中转义,请参阅my中的转义\u regex
函数。@Richard,它们非常重要。它们代表一个人试图将情感、手势和肢体语言的具体表现重新注入虚拟空间。它们携带大量信息。如果这里有人在网上开设迪斯科课程,请注意我想让你停下来,因为你对对话进行了极大的限制,以确保没有真正的对话。这就像让学生在课堂上不能使用面部表情、眼神、移动身体或手势一样。@RichardScriven抱歉,没有不敬,这说明了书面语言的重要性ge很难理解人们的语气。你把我激动的激动误解为冒犯:-)这是一个很好的解决方案,但是第三个字符串中的:-(
)呢…?我想他想保留它。这就是它存在的原因。@Tylerinker,谢谢你的伟大解决方案!在这种情况下,可以保留感叹号(!)和问号(?)还有?我需要用于情绪分析的数据集,这是一个非常好的创造性解决方案。OPC中的原始数据集为什么不匹配(表情符号)|标点符号
(注意表情符号在捕捉组中)然后替换为只匹配表情符号的捕获组1的结果?因为|
贪婪且左倾,表情符号会很早就匹配并且经常匹配。太好了!你能用这个想法发布一个答案吗?我有一些R代码使用这个正则表达式功能并按预期工作。我不想因此而得到赞扬:)我对R知之甚少,所以我不认为我会通过实际编写代码来添加任何内容,而这个答案通过以可维护的方式定义emots
已经做出了很大贡献
text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
escape_regex <- function(r) {
library("stringi")
stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
## start end
## [1,] 1 2
## [2,] 4 5
## [3,] 7 8
## [4,] 10 11
## [5,] 13 14
## [6,] 16 17
## [7,] 23 24
## [8,] 64 65
## [9,] 67 69
where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
## start end
## [1,] 1 1
## [2,] 2 2
## [3,] 4 4
## [4,] 7 7
## [5,] 8 8
## ...
## [26,] 72 72
## [27,] 73 73
## [28,] 99 99
## [29,] 107 107
which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
any(where_punct[i,1] >= where_emots[,1] &
where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
## start end
## [1,] 27 27
## [2,] 38 38
## [3,] 39 39
## [4,] 40 40
## [5,] 46 46
## [6,] 54 54
## [7,] 58 58
## [8,] 60 60
## [9,] 71 71
## [10,] 72 72
## [11,] 73 73
## [12,] 99 99
## [13,] 107 107
text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1] 58 41 32 59 80 32 58 93 32 58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-) and the salesperson said Oh boy"
library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)
reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]),
strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))
truncdf(left_just(dat[, 3, drop=F]), 50)
## Temp
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service
## 3 ãããæããããéãããæãããInappropriate announce:-(
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]),
strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label = c("ãããæããããéãããæãããInappropriate announce:-(",
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something you are working to fix?",
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)",
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D",
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...",
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L,
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54",
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -6L))
emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))
library(rex)
re_matches(x$text,
rex(
capture(name = 'emoticons',
or(emots)
),
global = T)
#>[[1]]
#> emoticon
#>1 :D
#>2 :D
#>
#>[[2]]
#> emoticon
#>1 <NA>
#>
#>[[3]]
#> emoticon
#>1 :-(
#>
#>[[4]]
#> emoticon
#>1 <NA>
#>
#>[[5]]
#> emoticon
#>1 :p
#>
#>[[6]]
#> emoticon
#>1 :)