R 如何使用quanteda保留句子的开头和结尾标记_R_Nlp_Text Mining_Tm_Quanteda

R 如何使用quanteda保留句子的开头和结尾标记

r nlp

R 如何使用quanteda保留句子的开头和结尾标记,r,nlp,text-mining,tm,quanteda,R,Nlp,Text Mining,Tm,Quanteda,我正在尝试使用R的quanteda软件包创建3克我正在努力寻找一种方法来保留n-gram的句首和句尾标记，和，如下代码所示我认为将keptFeatures与匹配它们的正则表达式一起使用应该可以维护它们，但V形标记总是被删除如何避免删除V形标记，或者用quanteda分隔句子开头和结尾的最佳方法是什么作为一个额外的问题，docfreq（mydfm）相对于colSums（mydfm），str（colSums（mydfm））和str（docfreq（mydfm））的结果几乎相同（Named n

我正在尝试使用R的

quanteda

软件包创建3克

我正在努力寻找一种方法来保留n-gram的句首和句尾标记，

和

，如下代码所示

我认为将

keptFeatures

与匹配它们的正则表达式一起使用应该可以维护它们，但V形标记总是被删除

如何避免删除V形标记，或者用

quanteda

分隔句子开头和结尾的最佳方法是什么

作为一个额外的问题，

docfreq（mydfm）

相对于

colSums（mydfm）

，str（colSums（mydfm））和str（docfreq（mydfm））的结果几乎相同（

Named num[1:n]

前者，

Named int[1:n]

后者）

库（quanteda）
text这样的方法怎么样：
ngrams(
  tokenize(
    unlist(
      segment(text, what = "other", delimiter = "(?<=\\</s\\>)", perl = TRUE)),
    what = "fastestword", simplify = TRUE),
  n = 3L)

# [1] "<s>I'm_a_sentence"              "a_sentence_and"                
# [3] "sentence_and_I'd"               "and_I'd_better"                
# [5] "I'd_better_be"                  "better_be_formatted"           
# [7] "be_formatted_properly!</s>"     "formatted_properly!</s>_<s>I'm"
# [9] "properly!</s>_<s>I'm_a"         "<s>I'm_a_second"               
#[11] "a_second_sentence</s>"

ngrams(
标记化(
非上市(
段（text，what=“other”，delimiter=“（？要返回一个简单向量，只需取消列出从
tokenize（）返回的tokenizedText”对象（这是一个特殊分类的列表，带有附加属性）。这里我使用了在“\\s”上拆分的
what=“fasterword”，它比
what=“fastestword”稍微聪明一点在
上拆分的

“`

为了在dfm中保留V形标记，您可以通过与上面在

tokenize（）

调用中使用的相同选项，因为

dfm（）

调用

tokenize（）

，但默认值不同——它使用的是大多数用户可能想要的，而

tokenize（）

则保守得多

# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
## [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
## [11] "sentence</s>"

更新：quanteda v0.9.9中的某些命令和行为已发生更改：

返回一个简单的向量，保留V形：

as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
#  [1] "<s>I'm_a_sentence"                "a_sentence_and"                   "sentence_and_I'd"                
#  [4] "and_I'd_better"                   "I'd_better_be"                    "better_be_formatted"             
#  [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"  "properly!</s><s>I'm_a_second"    
# [10] "a_second_sentence</s>"

as.character（这并不是说它解决了问题，而是论证了keptFeatures
。不同的语法但达到了目的，谢谢！奇怪的是dfm
似乎不尊重keptFeatures
，仍然好奇是否有办法让它工作，不管怎样，你的回答显示了解决我问题的不止一种方法，我将标记为accep。）ted答案。很好的程序包！！为什么有些Ngram包含3个以上的单词，比如“格式正确！我是”
。看起来一些句子结尾的标签和句子开头的标签在一起，比如“我是秒”
谢谢！为什么：因为“快字”拆分为空格，并且没有空格。解决此问题的最简单方法是预先将文本下放到
中，以
替换，或者使用@Jota建议的段（）方法。
# how to not remove the <s>, and return a vector 
unlist(toks <- tokenize(text, ngrams = 3, what = "fasterword"))
## [1] "<s>I'm_a_sentence"                "a_sentence_and"                  
## [3] "sentence_and_I'd"                 "and_I'd_better"                  
## [5] "I'd_better_be"                    "better_be_formatted"             
## [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" 
## [9] "properly!</s><s>I'm_a_second"     "a_second_sentence</s>" 

# keep it within sentence
(sents <- unlist(tokenize(text, what = "sentence")))
## [1] "<s>I'm a sentence and I'd better be formatted properly!"
## [2] "</s><s>I'm a second sentence</s>" 
tokenize(sents, ngrams = 3, what = "fasterword")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"        
## [5] "I'd_better_be"          "better_be_formatted"    "be_formatted_properly!"
## 
## Component 2 :
## [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"

# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
## [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
## [11] "sentence</s>" 

# Difference between docfreq() and colSums():
myDfm2 <- dfm(inaugTexts[1:4], verbose = FALSE)
myDfm2[, "representatives"]
docfreq(myDfm2)["representatives"]
colSums(myDfm2)["representatives"]
## Document-feature matrix of: 4 documents, 1 feature.
## 4 x 1 sparse Matrix of class "dfmSparse"
##                  features
## docs              representatives
##   1789-Washington               2
##   1793-Washington               0
##   1797-Adams                    2
##   1801-Jefferson                0
docfreq(myDfm2)["representatives"]
## representatives 
##               2 
colSums(myDfm2)["representatives"]
## representatives 
##               4 

as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
#  [1] "<s>I'm_a_sentence"                "a_sentence_and"                   "sentence_and_I'd"                
#  [4] "and_I'd_better"                   "I'd_better_be"                    "better_be_formatted"             
#  [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"  "properly!</s><s>I'm_a_second"    
# [10] "a_second_sentence</s>" 

(sents <- as.character(tokens(text, what = "sentence")))
# [1] "<s>I'm a sentence and I'd better be formatted properly!" "</s><s>I'm a second sentence</s>"                       
tokens(sents, ngrams = 3, what = "fasterword")
# tokens from 2 documents.
# Component 1 :
# [1] "<s>I'm_a_sentence"      "a_sentence_and"         "sentence_and_I'd"       "and_I'd_better"         "I'd_better_be"         
# [6] "better_be_formatted"    "be_formatted_properly!"
# 
# Component 2 :
# [1] "</s><s>I'm_a_second"   "a_second_sentence</s>"

featnames(dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE))
#  [1] "<s>i'm"              "a"                   "sentence"            "and"                 "i'd"                
#  [6] "better"              "be"                  "formatted"           "properly!</s><s>i'm" "second"             
# [11] "sentence</s>"