R 如何使用quanteda保留句子的开头和结尾标记
我正在尝试使用R的R 如何使用quanteda保留句子的开头和结尾标记,r,nlp,text-mining,tm,quanteda,R,Nlp,Text Mining,Tm,Quanteda,我正在尝试使用R的quanteda软件包创建3克 我正在努力寻找一种方法来保留n-gram的句首和句尾标记,和,如下代码所示 我认为将keptFeatures与匹配它们的正则表达式一起使用应该可以维护它们,但V形标记总是被删除 如何避免删除V形标记,或者用quanteda分隔句子开头和结尾的最佳方法是什么 作为一个额外的问题,docfreq(mydfm)相对于colSums(mydfm),str(colSums(mydfm))和str(docfreq(mydfm))的结果几乎相同(Named n
quanteda
软件包创建3克
我正在努力寻找一种方法来保留n-gram的句首和句尾标记,
和
,如下代码所示
我认为将keptFeatures
与匹配它们的正则表达式一起使用应该可以维护它们,但V形标记总是被删除
如何避免删除V形标记,或者用quanteda
分隔句子开头和结尾的最佳方法是什么
作为一个额外的问题,docfreq(mydfm)
相对于colSums(mydfm)
,str(colSums(mydfm))和str(docfreq(mydfm))的结果几乎相同(Named num[1:n]
前者,Named int[1:n]
后者)
库(quanteda)
text这样的方法怎么样:
ngrams(
tokenize(
unlist(
segment(text, what = "other", delimiter = "(?<=\\</s\\>)", perl = TRUE)),
what = "fastestword", simplify = TRUE),
n = 3L)
# [1] "<s>I'm_a_sentence" "a_sentence_and"
# [3] "sentence_and_I'd" "and_I'd_better"
# [5] "I'd_better_be" "better_be_formatted"
# [7] "be_formatted_properly!</s>" "formatted_properly!</s>_<s>I'm"
# [9] "properly!</s>_<s>I'm_a" "<s>I'm_a_second"
#[11] "a_second_sentence</s>"
ngrams(
标记化(
非上市(
段(text,what=“other”,delimiter=“(?要返回一个简单向量,只需取消列出从
tokenize()返回的tokenizedText”对象(这是一个特殊分类的列表,带有附加属性)。这里我使用了在“\\s”上拆分的
what=“fasterword”,它比
what=“fastestword”稍微聪明一点在
上拆分的
“`
为了在dfm中保留V形标记,您可以通过与上面在tokenize()
调用中使用的相同选项,因为dfm()
调用tokenize()
,但默认值不同——它使用的是大多数用户可能想要的,而tokenize()
则保守得多
# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm" "a" "sentence" "and" "i'd"
## [6] "better" "be" "formatted" "properly!</s><s>i'm" "second"
## [11] "sentence</s>"
更新:quanteda v0.9.9中的某些命令和行为已发生更改:
返回一个简单的向量,保留V形:
as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
# [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd"
# [4] "and_I'd_better" "I'd_better_be" "better_be_formatted"
# [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" "properly!</s><s>I'm_a_second"
# [10] "a_second_sentence</s>"
as.character(这并不是说它解决了问题,而是论证了keptFeatures
。不同的语法但达到了目的,谢谢!奇怪的是dfm
似乎不尊重keptFeatures
,仍然好奇是否有办法让它工作,不管怎样,你的回答显示了解决我问题的不止一种方法,我将标记为accep。)ted答案。很好的程序包!!为什么有些Ngram包含3个以上的单词,比如“格式正确!我是”
。看起来一些句子结尾的标签和句子开头的标签在一起,比如“我是秒”
谢谢!为什么:因为“快字”拆分为空格,并且没有空格。解决此问题的最简单方法是预先将文本下放到
中,以
替换,或者使用@Jota建议的段()
方法。
# how to not remove the <s>, and return a vector
unlist(toks <- tokenize(text, ngrams = 3, what = "fasterword"))
## [1] "<s>I'm_a_sentence" "a_sentence_and"
## [3] "sentence_and_I'd" "and_I'd_better"
## [5] "I'd_better_be" "better_be_formatted"
## [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a"
## [9] "properly!</s><s>I'm_a_second" "a_second_sentence</s>"
# keep it within sentence
(sents <- unlist(tokenize(text, what = "sentence")))
## [1] "<s>I'm a sentence and I'd better be formatted properly!"
## [2] "</s><s>I'm a second sentence</s>"
tokenize(sents, ngrams = 3, what = "fasterword")
## tokenizedText object from 2 documents.
## Component 1 :
## [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd" "and_I'd_better"
## [5] "I'd_better_be" "better_be_formatted" "be_formatted_properly!"
##
## Component 2 :
## [1] "</s><s>I'm_a_second" "a_second_sentence</s>"
# Bonus questions:
myDfm <- dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE)
# "chevron" markers are not removed
features(myDfm)
## [1] "<s>i'm" "a" "sentence" "and" "i'd"
## [6] "better" "be" "formatted" "properly!</s><s>i'm" "second"
## [11] "sentence</s>"
# Difference between docfreq() and colSums():
myDfm2 <- dfm(inaugTexts[1:4], verbose = FALSE)
myDfm2[, "representatives"]
docfreq(myDfm2)["representatives"]
colSums(myDfm2)["representatives"]
## Document-feature matrix of: 4 documents, 1 feature.
## 4 x 1 sparse Matrix of class "dfmSparse"
## features
## docs representatives
## 1789-Washington 2
## 1793-Washington 0
## 1797-Adams 2
## 1801-Jefferson 0
docfreq(myDfm2)["representatives"]
## representatives
## 2
colSums(myDfm2)["representatives"]
## representatives
## 4
as.character(toks <- tokens(text, ngrams = 3, what = "fasterword"))
# [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd"
# [4] "and_I'd_better" "I'd_better_be" "better_be_formatted"
# [7] "be_formatted_properly!</s><s>I'm" "formatted_properly!</s><s>I'm_a" "properly!</s><s>I'm_a_second"
# [10] "a_second_sentence</s>"
(sents <- as.character(tokens(text, what = "sentence")))
# [1] "<s>I'm a sentence and I'd better be formatted properly!" "</s><s>I'm a second sentence</s>"
tokens(sents, ngrams = 3, what = "fasterword")
# tokens from 2 documents.
# Component 1 :
# [1] "<s>I'm_a_sentence" "a_sentence_and" "sentence_and_I'd" "and_I'd_better" "I'd_better_be"
# [6] "better_be_formatted" "be_formatted_properly!"
#
# Component 2 :
# [1] "</s><s>I'm_a_second" "a_second_sentence</s>"
featnames(dfm(text, verbose = FALSE, what = "fasterword", removePunct = FALSE))
# [1] "<s>i'm" "a" "sentence" "and" "i'd"
# [6] "better" "be" "formatted" "properly!</s><s>i'm" "second"
# [11] "sentence</s>"