用tm软件包在r中查找关键短语_R_Data Mining_Text Mining_Tm

用tm软件包在r中查找关键短语

用tm软件包在r中查找关键短语,r,data-mining,text-mining,tm,R,Data Mining,Text Mining,Tm,我有一个项目要求我搜索不同公司的年度报告，并在其中找到关键短语。我已将报告转换为文本文件，创建并清理了语料库。然后我创建了一个文档术语矩阵。tm_term_score函数似乎只适用于单个单词，而不适用于短语。是否可以在语料库中搜索关键短语（不一定是最常见的）比如说- 我想看看语料库中每个文档中的“供应链金融”一词出现了多少次。但是，当我使用tm_term_score运行代码时，它返回没有文档包含该短语。。而事实上他们是这样做的我的进展如下 library(tm) library(string

我有一个项目要求我搜索不同公司的年度报告，并在其中找到关键短语。我已将报告转换为文本文件，创建并清理了语料库。然后我创建了一个文档术语矩阵。tm_term_score函数似乎只适用于单个单词，而不适用于短语。是否可以在语料库中搜索关键短语（不一定是最常见的）

比如说-

我想看看语料库中每个文档中的“供应链金融”一词出现了多少次。但是，当我使用tm_term_score运行代码时，它返回没有文档包含该短语。。而事实上他们是这样做的

我的进展如下

library(tm)
library(stringr)

setwd(‘C:/Users/Desktop/Annual Reports’)

dest<-“C:/Users/Desktop/Annual Reports”

a<-Corpus(DirSource(“C:/Users/Desktop/Annual Reports”), readerControl ≈ list (language ≈“lat”))

a<-tm_map(a, removeNumbers)
a<-tm_map(a, removeWords, stopwords(“english”))
a<-tm_map(a, removePunctuation)
a<-tm_map(a, stripWhitespace)

tokenizing.phrases<-c(“supply growth”，“import revenues”, “financing projects”)

library（tm）
图书馆（stringr）
setwd（'C:/Users/Desktop/Annual Reports'）
dest也许下面这样的东西会对你有所帮助
首先，用关键短语创建一个对象，例如
tokenizing.phrases <- c("general counsel", "chief legal officer", "inside counsel", "in-house counsel",
                        "law department", "law dept", "legal department", "legal function",
                        "law firm", "law firms", "external counsel", "outside counsel",
                        "law suit", "law suits", # can be hyphenated, eg.
                        "accounts payable", "matter management")

感谢您的回复。我仍在努力找出你的答案。我已经编辑了我的问题，包括你的一些建议，我理解。对不起！我对r很陌生，我感谢你的帮助。嗨，律师！当我输入您提供的代码时，我会收到以下错误和警告消息。str_detect（x，ignore_case=TRUE（tokenising.phrases））：未使用的参数（ignore_case=TRUE（tokenising.phrases）另外：警告消息：在if（is.na（a））中返回（“”）：条件的长度>1，将仅使用第一个元素如何解决此问题？我感谢您的帮助！
phraseTokenizer <- function(x) {  
  require(stringr)

  x <- as.character(x) # extract the plain text from the tm TextDocument object
  x <- str_trim(x)
  if (is.na(x)) return("")
  #warning(paste("doing:", x))
  phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

  if (any(phrase.hits)) {
    # only split once on the first hit, so not to worry about multiple occurrences of the same phrase
    split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
    # warning(paste("split phrase:", split.phrase))
    temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
    out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
  } else {
    out <- MC_tokenizer(x)
  }

  # get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
  out[out != ""]
}

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))