如何保持单格中的字内句点？R量子田_R_Nlp_N Gram_Quanteda

如何保持单格中的字内句点？R量子田

r nlp

如何保持单格中的字内句点？R量子田,r,nlp,n-gram,quanteda,R,Nlp,N Gram,Quanteda,我想在我的单字母频率表中保留两个字母首字母缩写词，它们之间用句号分隔，如“t.v.”和“u.s.”。当我用quanteda构建我的单谱频率表时，终止周期被截断了。下面是一个小的测试语料库来说明。我已删除句点作为句子分隔符： SOS这是美国，我们的政治是疯狂的EOS SOS在美国我们看了很多电视节目，也就是电视EOS SOS电视是美国EOS生活的重要组成部分美国以外的SOS人员可能不会看那么多电视EOS 生活在其他国家的SOS可能不会比EOS更疯狂 SOS访问EOS时，我很享受自己的理智我将其

我想在我的单字母频率表中保留两个字母首字母缩写词，它们之间用句号分隔，如“t.v.”和“u.s.”。当我用quanteda构建我的单谱频率表时，终止周期被截断了。下面是一个小的测试语料库来说明。我已删除句点作为句子分隔符：

SOS这是美国，我们的政治是疯狂的EOS

SOS在美国我们看了很多电视节目，也就是电视EOS

SOS电视是美国EOS生活的重要组成部分

美国以外的SOS人员可能不会看那么多电视EOS

生活在其他国家的SOS可能不会比EOS更疯狂

SOS访问EOS时，我很享受自己的理智

我将其作为字符向量加载到R中：

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")

等等

我想保留t.v.和u.s.上的终端时段，并删除表中的条目。频率为3

我也不明白为什么周期（.）在这个表中会有3个计数，而正确地计算美国和电视大学报（每个2个）。

这种行为的原因是quanteda的默认单词标记器使用基于ICU的单词边界定义（来自stringi包）<代码>美国显示为单词

美国

，后跟句点

标记。这是伟大的，如果你的名字是，但可能不是那么伟大的为您的目的。但是您可以使用传递给

tokens（）

的参数

what=“fasterword”

轻松切换到空白标记器，这是通过函数调用的

部分在dfm（）
中提供的一个选项
tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS" 

您可以在这里看到，u.s.
被保留在回答上一个问题时，终端
的文档频率为3，因为它作为单独的标记出现在三个文档中，这是删除\u punct=FALSE
时默认的单词标记器行为
要将其传递到dfm（）
，然后构建单词的文档频率的data.frame，以下代码可以工作（为了提高效率，我对其进行了一些整理）。请注意关于文档和术语频率之间差异的注释-我注意到一些用户对docfreq（）
有点困惑
#我删除了与默认选项相同的选项
#还要注意，stopwords=TRUE不是有效的参数-请参阅删除参数
dat.dfm完美。这正是我要找的。很好的编辑标题。感谢您如此详尽的回复，感谢您为这个包裹所做的大量工作。
       ngram frequency
1        SOS         6
2        EOS         6
3        the         4
4         is         3
5          .         3
6        u.s         2
7      crazy         2
8         US         2
9      watch         2
10        of         2
11       t.v         2
12        TV         2
13        in         2
14  probably         2
15      This         1
16     where         1
17       our         1
18  politics         1
19        In         1
20        we         1
21         a         1
22       lot         1
23       aka         1

tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS" 

# I removed the options that were the same as the default 
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")

# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
#       not the same as docfreq
# dat.dfm <- sort(dat.dfm)

# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
                        row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
##    ngram frequency
## 1    SOS         6
## 2    EOS         6
## 3    the         4
## 4     is         3
## 5   u.s.         2
## 6  crazy         2
## 7     US         2
## 8  watch         2
## 9     of         2
## 10  t.v.         2