R 使用tm（）挖掘两个和三个单词短语的PDF_R_Pdf_Text_Text Mining_Tm

R 使用tm（）挖掘两个和三个单词短语的PDF

r pdf text

R 使用tm（）挖掘两个和三个单词短语的PDF,r,pdf,text,text-mining,tm,R,Pdf,Text,Text Mining,Tm,我试图为特定的两个和三个单词短语挖掘一组PDF。我知道这个问题是在各种情况下提出的这在一定程度上起了作用。但是，列表不会返回包含多个单词的字符串例如，我尝试过这些线程中提供的解决方案（以及其他许多线程）。不幸的是，什么都不管用此外，qdap库不会加载，我浪费了一个小时试图解决这个问题，因此也不会工作，即使它看起来相当简单 library(tm) data("crude") crude <- as.VCorpus(crude) crude <- tm_map(crude, c

我试图为特定的两个和三个单词短语挖掘一组PDF。我知道这个问题是在各种情况下提出的

这在一定程度上起了作用。但是，列表不会返回包含多个单词的字符串

例如，我尝试过这些线程中提供的解决方案（以及其他许多线程）。不幸的是，什么都不管用

此外，qdap库不会加载，我浪费了一个小时试图解决这个问题，因此也不会工作，即使它看起来相当简单

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")

dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)

library（tm）
数据（“原油”）
原油这里有一种方法，可以使用tm软件包和RWeka获得您想要的东西。您需要创建一个单独的标记器函数，将其插入DocumentTermMatrix
函数。RWeka在这方面与tm
配合得非常好
如果由于java依赖关系而不想安装RWeka，可以使用任何其他包，如tidytext或quanteda。如果由于数据的大小而需要速度，我建议使用quanteda包（tm代码下面的示例）。Quanteda并行运行，使用Quanteda_选项
可以指定要使用的内核数量（默认为2个内核）
注:
请注意，词典中的单字和双字重叠。在使用的示例中，您将看到文本127中的“价格”（3）和“合同价格”（1）将双重计算价格
library(tm)
library(RWeka)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")


# adjust to min = 2 and max = 3 for 2 and 3 word ngrams
RWeka_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
}

dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                              dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)

library（tm）
图书馆（鲁韦卡）
数据（“原油”）
粗略如果您首先在phrase（）
函数中包装多单词模式，则可以在quanteda中使用一系列标记
library("quanteda")
#> Package version: 1.5.1

data("crude", package = "tm")
data_corpus_crude <- corpus(crude)

my_words <- c("diamond", "contract prices", "diamond shamrock")

或者，要使它们永久成为“复合”令牌，请使用tokens\u composite（）
（在这里的一个简单示例中显示）
我不知道你更大的目标是什么，但如果你在你的数据框中添加check.names
，你会得到“石油公司”
，data.frame（docs=dtm$dimnames$docs，as.matrix（dtm），row.names=NULL，check.names=FALSE）
@RonakShah谢谢你的回复。更大的目标是在我的语料库中搜索特定的短语。这似乎并没有解决问题——虽然它显示的是“oil corporation”而不是“oil.corporation”，但它仍然不计算任何一个短语。例如，查看文本127，“合同价格”和“钻石三叶草”至少出现一次。如果我用我的单词替换上面的我的单词容器，效果会很好。quanteda软件包似乎是我需要的，谢谢。
library("quanteda")
#> Package version: 1.5.1

data("crude", package = "tm")
data_corpus_crude <- corpus(crude)

my_words <- c("diamond", "contract prices", "diamond shamrock")

kwic(data_corpus_crude, pattern = phrase(my_words))
#>                                                               
#>    [127, 1:1]                             |     Diamond      |
#>    [127, 1:2]                             | Diamond Shamrock |
#>  [127, 12:13]        today it had cut its | contract prices  |
#>  [127, 71:71] a company spokeswoman said. |     Diamond      |
#>                                   
#>  Shamrock Corp said that effective
#>  Corp said that effective today   
#>  for crude oil by 1.50            
#>  is the latest in a

tokens("The diamond mining company is called Diamond Shamrock.") %>%
    tokens_compound(pattern = phrase(my_words))
#> tokens from 1 document.
#> text1 :
#> [1] "The"              "diamond"          "mining"          
#> [4] "company"          "is"               "called"          
#> [7] "Diamond_Shamrock" "."