R 如何强制一个语料库以与另一个语料库相同的术语进行评估问题_R_Nlp

R 如何强制一个语料库以与另一个语料库相同的术语进行评估问题

r nlp

R 如何强制一个语料库以与另一个语料库相同的术语进行评估问题,r,nlp,R,Nlp,我需要有一个文本语料库，应用与另一个相同的术语，这样我就可以得到一个具有相同值的术语文档矩阵。我尝试的是，使用逻辑回归在两组之间对不同的文本语料库进行分类，但我需要这两个语料库具有函数DocumentTermMatrix（）中相同的变量当前尝试和代码解释我不知道如何处理这个问题，例如，这给了我第一个术语矩阵及其频率： data("crude") crude_1 <- crude[1:10] dtm_1 <- DocumentTermMatrix(crude_

我需要有一个文本语料库，应用与另一个相同的术语，这样我就可以得到一个具有相同值的术语文档矩阵。我尝试的是，使用逻辑回归在两组之间对不同的文本语料库进行分类，但我需要这两个语料库具有函数

DocumentTermMatrix（）

中相同的变量

当前尝试和代码解释我不知道如何处理这个问题，例如，这给了我第一个术语矩阵及其频率：

data("crude")
crude_1 <- crude[1:10]
dtm_1 <- DocumentTermMatrix(crude_1)
dtm_1$dimnames$Terms
#  [1] "..."               "\"(it)"            "\"demand"          "\"for"            
#  [5] "\"growth"          "\"if"              "\"is"              "\"may" ...

data("crude")
# First dataset and term list
crude_1 <- crude[1:10]
dtm_1 <- DocumentTermMatrix(crude_1)
term_list <- dtm_1$dimnames$Terms

# Second dataset
crude_2 <- crude[11:20]
dtm_2 <- DocumentTermMatrix(crude_2)

# Creating a dummy column to remove at the end
X <- data.frame(dummy_col = 1:dtm_2$nrow)
for (term in term_list) {
    temp_col <- tm_term_score(dtm_2, term)
    # Attaching the column to the DF
    X$temp_col<-temp_col
    names(X)[length(names(X))] <- term
}
# Removing the dummy column
X$dummy_col <- NULL
# The variable X now contains the term frequency of the first data set, but applied to the second

我可以尝试在

原油2

上运行相同的频率。然而，这在计算方面是昂贵的，并且你可能知道这个问题的一个实际解决方案

问题: 我想强制

dtm_2

使用与

dtm_1

中相同的术语。仅使用

原油2

数据集的频率。有没有一种实用的方法可以在R中实现这一点

或更简单的示例：假设我想知道，这些文本中出现了多少次
zebra
或
girafe
，并且我想明确地执行，我如何继续？

使用的库：

library（tm）

好的，所以我通过使用包

tm

中的函数找到了一个解决问题的方法，我认为它工作得很好，尽管欢迎任何不同的实现

正常溶液这就是解决方案，首先我们在一个单独的变量中捕获术语，然后应用术语列表在其他文档术语矩阵上创建一个矩阵，频率为：

data("crude")
crude_1 <- crude[1:10]
dtm_1 <- DocumentTermMatrix(crude_1)
dtm_1$dimnames$Terms
#  [1] "..."               "\"(it)"            "\"demand"          "\"for"            
#  [5] "\"growth"          "\"if"              "\"is"              "\"may" ...

data("crude")
# First dataset and term list
crude_1 <- crude[1:10]
dtm_1 <- DocumentTermMatrix(crude_1)
term_list <- dtm_1$dimnames$Terms

# Second dataset
crude_2 <- crude[11:20]
dtm_2 <- DocumentTermMatrix(crude_2)

# Creating a dummy column to remove at the end
X <- data.frame(dummy_col = 1:dtm_2$nrow)
for (term in term_list) {
    temp_col <- tm_term_score(dtm_2, term)
    # Attaching the column to the DF
    X$temp_col<-temp_col
    names(X)[length(names(X))] <- term
}
# Removing the dummy column
X$dummy_col <- NULL
# The variable X now contains the term frequency of the first data set, but applied to the second