R 如何删除tm软件包中包含单词的括号？_R_Tm_Punctuation

R 如何删除tm软件包中包含单词的括号？

R 如何删除tm软件包中包含单词的括号？,r,tm,punctuation,R,Tm,Punctuation,假设我有一部分文本在这样的文档中： "Other segment comprised of our active pharmaceutical ingredient (API) business,which..." 我想删除“（API）”，需要在 corpus <- tm_map(corpus, removePunctuation) 我搜索了很长时间，但我能找到的只是关于删除括号的答案，我不想在语料库中出现的单词我真的需要有人给我一些提示。如果只是一个单词，那么（未经测试）如何：

假设我有一部分文本在这样的文档中：

"Other segment comprised of our active pharmaceutical ingredient (API) business,which..."

我想删除“（API）”，需要在

corpus <- tm_map(corpus, removePunctuation)

我搜索了很长时间，但我能找到的只是关于删除括号的答案，我不想在语料库中出现的单词

我真的需要有人给我一些提示。如果只是一个单词，那么（未经测试）如何：

removeBracketed您可以使用更智能的标记器，例如quanteda包中的标记器，其中removePunct=TRUE
将自动删除括号
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
##  [1] "Other"          "segment"        "comprised"      "of"             "our"            ## "active"         "pharmaceutical"
##  [8] "ingredient"     "API"            "business"       "which"         

添加：
如果您想首先标记文本，那么您需要lappy
agsub
，直到我们将正则表达式valuetype
添加到removeffeatures.tokenizedTexts（）
中quanteda。但这是可行的：
# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other"             "segment"           "comprised"         "of"                "our"               "active"           
## [7] "pharmaceutical"    "ingredient"        "business,which..."

#标记化版本
要求（定量）
谢谢你的回答，但我需要删除的不仅仅是括号。内的单词也需要删除。好的，我已经修改了我的答案，见上文。@JohnChou谢谢你让我知道。“如果你想先对文本进行标记化，那么在我们添加正则表达式valuetype以移除quanteda中的Features.TokenizedText（）之前，你需要使用一个gsub。”我不太明白，你能详细解释一下吗？谢谢
quanteda::tokenize(txt, removePunct = TRUE)
## tokenizedText object from 1 document.
## Component 1 :
##  [1] "Other"          "segment"        "comprised"      "of"             "our"            ## "active"         "pharmaceutical"
##  [8] "ingredient"     "API"            "business"       "which"         

# tokenized version
require(quanteda)
toks <- tokenize(txt, what = "fasterword", simplify = TRUE)
toks[-grep("^\\(.*\\)$", toks)]
## [1] "Other"             "segment"           "comprised"         "of"                "our"               "active"           
## [7] "pharmaceutical"    "ingredient"        "business,which..."

# exactly as in the question
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt)
## [1] "Other segment comprised of our active pharmaceutical ingredient business,which..."

# with added punctuation
txt2 <- "ingredient (API), business,which..."
txt3 <- "ingredient (API).  New sentence..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt2)
## [1] "ingredient, business,which..."
gsub("\\s(\\(\\w*\\))(\\s|[[:punct:]])", "\\2", txt3)
## [1] "ingredient.  New sentence..."