Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
String 删除标点符号但保留表情符号?_String_R_Text_Gsub_Emoticons - Fatal编程技术网

String 删除标点符号但保留表情符号?

String 删除标点符号但保留表情符号?,string,r,text,gsub,emoticons,String,R,Text,Gsub,Emoticons,是否可以删除所有标点符号,但保留以下表情符号 :-( :) :D :p 1.工作纯正则表达式解决方案(又称编辑#2) 这个任务完全可以用正则表达式来完成(非常感谢@Mike Samuel) 首先,我们建立一个表情符号数据库: (emots <- as.character(outer(c(":", ";", ":-", ";-"), + c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste))) #

是否可以删除所有标点符号,但保留以下表情符号

:-(

:)

:D

:p

1.工作纯正则表达式解决方案(又称编辑#2) 这个任务完全可以用正则表达式来完成(非常感谢@Mike Samuel)

首先,我们建立一个表情符号数据库:

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
顺便说一句,如果您只想删除一组选定的字符,请在上面添加例如
[,]
而不是
[\\p{p}]

2.正则表达式解决方案提示-我的第一次(非明智)尝试(又称原始答案) 我的第一个想法(主要是出于“历史原因”)是通过使用来解决这个问题,但正如你所看到的,这还远远不够完美

要删除所有
后面不跟
D
X
8
[
,或
]
使用负片后视:

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"
现在我们可以添加一些老式的表情符号(带鼻子,例如
:-)
;-D
等)

现在删除连字符(负片向后看和向前看)

一种辅助函数,用于转义某些特殊字符,以便在正则表达式中使用:

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}
由于一些标点符号出现在表情符号中,我们不应将其移除:

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107
结果是:

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

这就是。

这是一种比@gagolews的解决方案更不复杂、更慢的方法。它需要你给它一本表情词典。您可以创建或使用
qdapDictionaries
包中的一个。基本方法是将表情符号转换为不会被误认为其他任何东西的文本(我使用
dat$Temp我将此功能添加到
qdap version>2.0.0
中,作为
sub_holder
函数。基本上,此函数使用,但减轻了编码负担。
sub_holder
函数获取文本向量和要细分的项目(如表情符号)。它返回一个包含以下内容的列表:

  • 为占位符指定子项的测试向量
  • 一个函数(称为
    unhold
    ),用于交换原始条款的持有者
  • 代码如下:

    emos <- c(":-(", ":)", ":D", ":p", "X-(")
    (m <- sub_holder(emos, dat[,1]))
    m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
    
    emos使用可能会使这类任务变得更简单。它会根据需要自动转义字符,如果放入
    或()
    函数中,则会自动转义向量的所有元素。
    使用全局参数重新匹配()
    ,将获得给定行的所有表情的列表

    x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
    "@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
    "@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
    "RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
    "xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
    "You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
    ), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
    1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
    "3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
    "created"), class = "data.frame", row.names = c(NA, -6L))
    
    emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))
    
    library(rex)
    re_matches(x$text,
      rex(
        capture(name = 'emoticons',
          or(emots)
        ),
      global = T)
    
    #>[[1]]
    #>  emoticon
    #>1       :D
    #>2       :D
    #>
    #>[[2]]
    #>  emoticon
    #>1     <NA>
    #>
    #>[[3]]
    #>  emoticon
    #>1      :-(
    #>
    #>[[4]]
    #>  emoticon
    #>1     <NA>
    #>
    #>[[5]]
    #>  emoticon
    #>1       :p
    #>
    #>[[6]]
    #>  emoticon
    #>1       :)
    
    x=structure(list(text=structure)(c(4L,6L,1L,2L,5L,3L),.Label=c(墽岥岥岥岥岥岥岥岥),
    “@AirAsia您的直接借记(Maybank)支付网关不工作。这是您正在努力解决的问题吗?”,
    “@AirAsia除了从普吉岛返回途中的轻微延误和食物短缺外,两次航班都非常顺利。荣誉:)”,
    “RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@AirAsia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT@Asia:RT,
    “xdek ke航班@AirAsia Malaysia飞往洛杉矶……哈哈……p bagi LA promo murah2 sikit,kompom aku beli……”,
    “当客户服务要求您等待103分钟,而您的号码是42时,您知道有问题。X-(”
    ),class=“factor”),创建=结构(c(5L,4L,4L,3L,2L,
    1L),.Label=c(“2014年1月2日16:14”、“2014年1月2日17:00”、“2014年3月2日0:54”,
    “2014年3月2日0:58”,“2014年3月2日1:28”),class=“factor”),.Names=c(“文本”,
    “已创建”),class=“data.frame”,row.names=c(NA,-6L))
    emots[[1]]
    #>表情符号
    #>1:D
    #>2:D
    #>
    #>[[2]]
    #>表情符号
    #>1     
    #>
    #>[[3]]
    #>表情符号
    #>1      :-(
    #>
    #>[[4]]
    #>表情符号
    #>1     
    #>
    #>[[5]]
    #>表情符号
    #>1:p
    #>
    #>[[6]]
    #>表情符号
    #>1       :)
    
    首先标记表情(
    gsub
    使用$SMILEY1这样的标记标记它们),然后删除标点符号,然后用表情符号替换表情符号show,如果对于某些表情符号,例如=(我将收到错误消息“error in gsub(=”,“emoticons”,Data_edited_txt$text):无效的正则表达式“=”(”,原因是“缺失”)@user3456230由于
    应该在正则表达式中转义,请参阅my中的
    转义\u regex
    函数。@Richard,它们非常重要。它们代表一个人试图将情感、手势和肢体语言的具体表现重新注入虚拟空间。它们携带大量信息。如果这里有人在网上开设迪斯科课程,请注意我想让你停下来,因为你对对话进行了极大的限制,以确保没有真正的对话。这就像让学生在课堂上不能使用面部表情、眼神、移动身体或手势一样。@RichardScriven抱歉,没有不敬,这说明了书面语言的重要性ge很难理解人们的语气。你把我激动的激动误解为冒犯:-)这是一个很好的解决方案,但是第三个字符串中的
    :-(
    )呢…?我想他想保留它。这就是它存在的原因。@Tylerinker,谢谢你的伟大解决方案!在这种情况下,可以保留感叹号(!)和问号(?)还有?我需要用于情绪分析的数据集,这是一个非常好的创造性解决方案。OPC中的原始数据集为什么不匹配
    (表情符号)|标点符号
    (注意表情符号在捕捉组中)然后替换为只匹配表情符号的捕获组1的结果?因为
    |
    贪婪且左倾,表情符号会很早就匹配并且经常匹配。太好了!你能用这个想法发布一个答案吗?我有一些R代码使用这个正则表达式功能并按预期工作。我不想因此而得到赞扬:)我对R知之甚少,所以我不认为我会通过实际编写代码来添加任何内容,而这个答案通过以可维护的方式定义
    emots
    已经做出了很大贡献
    text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"
    
    escape_regex <- function(r) {
       library("stringi")
       stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
    }
    
    (regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
    ## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"
    
    where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
    print(where_emots)
    ##       start end
    ##  [1,]     1   2
    ##  [2,]     4   5
    ##  [3,]     7   8
    ##  [4,]    10  11
    ##  [5,]    13  14
    ##  [6,]    16  17
    ##  [7,]    23  24
    ##  [8,]    64  65
    ##  [9,]    67  69
    
    where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
    print(where_punct)
    ##       start end
    ##  [1,]     1   1
    ##  [2,]     2   2
    ##  [3,]     4   4
    ##  [4,]     7   7
    ##  [5,]     8   8
    ## ...
    ## [26,]    72  72
    ## [27,]    73  73
    ## [28,]    99  99
    ## [29,]   107 107
    
    which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
       any(where_punct[i,1] >= where_emots[,1] &
            where_punct[i,2] <= where_emots[,2]) })
    where_punct <- where_punct[!which_punct_omit,] # update where_punct
    print(where_punct)
    ##       start end
    ##  [1,]    27  27
    ##  [2,]    38  38
    ##  [3,]    39  39
    ##  [4,]    40  40
    ##  [5,]    46  46
    ##  [6,]    54  54
    ##  [7,]    58  58
    ##  [8,]    60  60
    ##  [9,]    71  71
    ## [10,]    72  72
    ## [11,]    73  73
    ## [12,]    99  99
    ## [13,]   107 107
    
    text_tmp <- stri_enc_toutf32(text)[[1]]
    print(text_tmp) # here - just ASCII codes...
    ## [1]  58  41  32  59  80  32  58  93  32  58....
    text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!
    
    stri_enc_fromutf32(text_tmp)
    ## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"
    
    library(qdap)
    #reps <- emoticon
    emos <- c(":-(", ":)", ":D", ":p", "X-(")
    reps <- data.frame(seq_along(emos), emos)
    
    reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
    dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
    dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
        strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))
    
    truncdf(left_just(dat[, 3, drop=F]), 50)
    
    ##   Temp                                              
    ## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
    ## 2 You know there is a problem when customer service 
    ## 3 ãããæããããéãããæãããInappropriate announce:-(         
    ## 4 AirAsia your direct debit Maybank payment gateways
    ## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
    ## 6 AirAsia Apart from the slight delay and shortage o
    
    dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
        strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
    
    emos <- c(":-(", ":)", ":D", ":p", "X-(")
    (m <- sub_holder(emos, dat[,1]))
    m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))
    
    x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
    "@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
    "@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
    "RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
    "xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
    "You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
    ), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
    1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
    "3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
    "created"), class = "data.frame", row.names = c(NA, -6L))
    
    emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))
    
    library(rex)
    re_matches(x$text,
      rex(
        capture(name = 'emoticons',
          or(emots)
        ),
      global = T)
    
    #>[[1]]
    #>  emoticon
    #>1       :D
    #>2       :D
    #>
    #>[[2]]
    #>  emoticon
    #>1     <NA>
    #>
    #>[[3]]
    #>  emoticon
    #>1      :-(
    #>
    #>[[4]]
    #>  emoticon
    #>1     <NA>
    #>
    #>[[5]]
    #>  emoticon
    #>1       :p
    #>
    #>[[6]]
    #>  emoticon
    #>1       :)