如何标记R中字典中没有的单词？_R_Dictionary

如何标记R中字典中没有的单词？

r dictionary

如何标记R中字典中没有的单词？,r,dictionary,R,Dictionary,我正在处理一组数据，需要对其进行标记以进行培训。在进行标记化之前，我已经创建了一个字典，因此我需要检索字典中的单词我的文本文件如下： t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow

我正在处理一组数据，需要对其进行标记以进行培训。在进行标记化之前，我已经创建了一个字典，因此我需要检索字典中的单词

我的文本文件如下：

t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."

t你需要检查字典的功能。它只返回字典中的单词
字典：
要制表的字符向量结果中不会列出任何其他术语。默认为空，表示单据中的所有条款均已列出
您可以使用以下代码。请注意，Remove标点符号也会删除“手持”之间的连字符。也没有必要这样做。无论如何，标记器都会删除大部分穿孔
编辑：基于评论
#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

#数据预处理
语料库你需要检查字典的功能。它只返回字典中的单词
字典：
要制表的字符向量结果中不会列出任何其他术语。默认为空，表示单据中的所有条款均已列出
您可以使用以下代码。请注意，Remove标点符号也会删除“手持”之间的连字符。也没有必要这样做。无论如何，标记器都会删除大部分穿孔
编辑：基于评论
#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

#数据预处理
谢谢你的努力。但是我需要字典中的单词和标记化的单词作为输出。他的效果更好！！！！字典中的单词是否有不需要标记的方法。例如，我希望我的输出看起来像这样：这样效果更好！！！！字典中的单词是否有不需要标记的方法。例如，我希望我的输出看起来像：（腹部外科医生、腹部器官、通道、足够的可见性、允许使用、切口、重症监护室、手持外科器械）。在这里，当文档中的其他单词被标记化时，字典中的单词也被获得。字典被标记化一次，文本被标记化3次，与字典匹配2次。如果你有一个庞大的语料库，这不是最有效的方法。由于您想要拥有字典的完整字符串，因此会遇到此类问题。至少现在你得到了字典的匹配和术语的频率。正如你所看到的，“重症监护病房”没有出现在本文中，也没有被计算或显示在最终结果中。我同意它。但重症监护室也应在场进行进一步处理。我正在为一个大型语料库工作，我一直在提取完整的字符串。谢谢你的努力。但是我需要字典中的单词和标记化的单词作为输出。他的效果更好！！！！字典中的单词是否有不需要标记的方法。例如，我希望我的输出看起来像这样：这样效果更好！！！！字典中的单词是否有不需要标记的方法。例如，我希望我的输出看起来像：（腹部外科医生、腹部器官、通道、足够的可见性、允许使用、切口、重症监护室、手持外科器械）。在这里，当文档中的其他单词被标记化时，字典中的单词也被获得。字典被标记化一次，文本被标记化3次，与字典匹配2次。如果你有一个庞大的语料库，这不是最有效的方法。由于您想要拥有字典的完整字符串，因此会遇到此类问题。至少现在你得到了字典的匹配和术语的频率。正如你所看到的，“重症监护病房”没有出现在本文中，也没有被计算或显示在最终结果中。我同意它。但重症监护室也应在场进行进一步处理。我正在为一个大型语料库工作，我一直在提取完整的字符串。
<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                            Docs
Terms                            character(0)
hand-held surgical instruments            0
intensive care unit                       0
traditional techniques                    1

#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)

#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            0
  intensive care unit                       0
  traditional techniques                    1

# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)

<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity           : 67%
Maximal term length: 30
Weighting          : term frequency (tf)

                                Docs
Terms                            character(0)
  hand-held surgical instruments            1
  intensive care unit                       0
  traditional techniques                    0

# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)

library(dplyr)    
df <- data.frame(terms = rownames(as.matrix(tdm_total)),   freq = rowSums(as.matrix(tdm_total)), row.names = NULL, stringsAsFactors = FALSE)
df <- df %>% group_by(terms) %>% summarise(sum(freq))