tm Bigrams解决方案仍在生成Unigram_R_Tm_N Gram

tm Bigrams解决方案仍在生成Unigram

tm Bigrams解决方案仍在生成Unigram,r,tm,n-gram,R,Tm,N Gram,我正在尝试使用tm的DocumentTermMatrix函数来生成一个带有bigram而不是Unigram的矩阵。我已尝试在我的函数中使用概述的示例（以下是三个示例）：然而，非常不幸的是，这三个版本的函数中的每一个都会产生完全相同的输出：一个带有单字符的DTM，而不是双字符的DTM（为了简单起见，包括图像）：为方便起见，以下是我正在处理的数据子集： x = data.frame("CaseName" = c("Attorney General's Reference (No.23 of 2

我正在尝试使用tm的DocumentTermMatrix函数来生成一个带有bigram而不是Unigram的矩阵。我已尝试在我的函数中使用概述的示例（以下是三个示例）：

然而，非常不幸的是，这三个版本的函数中的每一个都会产生完全相同的输出：一个带有单字符的DTM，而不是双字符的DTM（为了简单起见，包括图像）：

为方便起见，以下是我正在处理的数据子集：

x = data.frame("CaseName" = c("Attorney General's Reference (No.23 of 2011)", "Attorney General's Reference (No.31 of 2016)", "Joseph Hill & Co Solicitors, Re"),
               "CaseID"= c("[2011]EWCACrim1496", "[2016]EWCACrim1386", "[2013]EWCACrim775"),
               "CaseTranscriptText" = c("sanchez 2011 02187 6 appeal criminal division 8 2011 2011 ewca crim 14962011 wl 844075 wales wednesday 8 2011 attorney general reference 23 2011 36 criminal act 1988 representation qc general qc appeared behalf attorney general", 
                                        "attorney general reference 31 2016 201601021 2 appeal criminal division 20 2016 2016 ewca crim 13862016 wl 05335394 dbe honour qc sitting cacd wednesday 20 th 2016 reference attorney general 36 criminal act 1988 representation",
                                        "matter wasted costs against company solicitors 201205544 5 appeal criminal division 21 2013 2013 ewca crim 7752013 wl 2110641 date 21 05 2013 appeal honour pawlak 20111354 hearing date 13 th 2013 representation toole respondent qc appellants"))

您的代码存在一些问题。我只关注您创建的最后一个函数，因为我没有使用tau或Rweka包

1要使用标记器，您需要指定

tokenizer=…

，而不是

tokenize=…

2您需要的不是

Corpus

，而是

VCorpus

3在您的函数中调整此值后，我对结果不满意。并非控件选项中指定的所有内容都得到正确处理。我创建了第二个函数

make\u dtm\u adjusted

，以便您可以看到这两个函数之间的差异

# OP's function adjusted to make it work
make_dtm = function(main_df, stem=F){
  BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))
  decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenizer=BigramTokenizer,
                                                           stopwords=T,
                                                           tolower=T,
                                                           removeNumbers=T,
                                                           removePunctuation=T,
                                                           stemming = stem))
  return(decisions.dtm)
}

# improved function
make_dtm_adjusted = function(main_df, stem=F){
  BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))

  decisions <- tm_map(decisions, content_transformer(tolower))
  decisions <- tm_map(decisions, removeNumbers)
  decisions <- tm_map(decisions, removePunctuation)
  # specifying your own stopword list is better as you can use stopwords("smart")
  # or your own list
  decisions <- tm_map(decisions, removeWords, stopwords("english")) 
  decisions <- tm_map(decisions, stripWhitespace)

  decisions.dtm = DocumentTermMatrix(decisions, control = list(stemming = stem,
                                                               tokenizer=BigramTokenizer))
  return(decisions.dtm)
}

#OP的功能已调整以使其正常工作
make_dtm=功能（主方向，阀杆=F）{
BigramTokenizer=function（x）unlist（lapply（ngrams（words（x），2），paste，collapse=”“），use.names=FALSE）
decisions=VCorpus（向量源（main_df$casetrancripttext））
decisions.dtm=DocumentTermMatrix（decisions，control=list（tokenizer=BigramTokenizer，
stopwords=T，
tolower=T，
removeNumbers=T，
移除标点符号=T，
词干=词干）
返回（decisions.dtm）
}
#改进功能
使dtm调整=功能（主df、阀杆=F）{
BigramTokenizer=function（x）unlist（lapply（ngrams（words（x），2），paste，collapse=”“），use.names=FALSE）
decisions=VCorpus（向量源（main_df$casetrancripttext））
你能详细说明一下VCorpus和Corpus之间的区别吗？我已经使用Corpus很长一段时间了，没有任何问题。很抱歉，迟来的答复，但最下面的答案回答得很好。
# OP's function adjusted to make it work
make_dtm = function(main_df, stem=F){
  BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))
  decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenizer=BigramTokenizer,
                                                           stopwords=T,
                                                           tolower=T,
                                                           removeNumbers=T,
                                                           removePunctuation=T,
                                                           stemming = stem))
  return(decisions.dtm)
}

# improved function
make_dtm_adjusted = function(main_df, stem=F){
  BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))

  decisions <- tm_map(decisions, content_transformer(tolower))
  decisions <- tm_map(decisions, removeNumbers)
  decisions <- tm_map(decisions, removePunctuation)
  # specifying your own stopword list is better as you can use stopwords("smart")
  # or your own list
  decisions <- tm_map(decisions, removeWords, stopwords("english")) 
  decisions <- tm_map(decisions, stripWhitespace)

  decisions.dtm = DocumentTermMatrix(decisions, control = list(stemming = stem,
                                                               tokenizer=BigramTokenizer))
  return(decisions.dtm)
}