从R中的数字和停止字筛选文本（不适用于tdm）_R_Tm_Tidytext

从R中的数字和停止字筛选文本（不适用于tdm）

从R中的数字和停止字筛选文本（不适用于tdm）,r,tm,tidytext,R,Tm,Tidytext,我有文本语料库 mytextdata = read.csv(path to texts.csv) Mystopwords=read.csv(path to mystopwords.txt) 如何筛选此文本？我必须删除： 1) all numbers 2) pass through the stop words 3) remove the brackets 我不会使用dtm，我只需要从数字和停止字中清除此文本数据样本数据： 112773-Tablet for cleaning the h

我有文本语料库

mytextdata = read.csv(path to texts.csv)
Mystopwords=read.csv(path to mystopwords.txt)

如何筛选此文本？我必须删除：

1) all numbers

2) pass through the stop words

3) remove the brackets

我不会使用

dtm

，我只需要从数字和停止字中清除此文本数据

样本数据：

112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715

Jura，

是停止语

在我期望的输出中

  Tablet for cleaning hydraulic system

因为现在问题中有一个字符串可用，所以我决定自己创建一个示例数据。我希望这是一些接近你的实际数据。正如Nate所建议的，使用tidytext包是一种方法。在这里，我首先删除了数字、标点符号、括号中的内容以及括号本身。然后，我使用

unnest\u tokens（）

拆分每个字符串中的单词。然后，我删除了停止词。因为你有自己的停止词，你可能想创建自己的字典。我只是在

filter（）

部分添加了

jura

。通过id对数据进行分组，我将这些单词组合在一起，以便在

summary（）

中创建字符串。注意，我使用了

jura

而不是

jura

。这是因为

unnest_tokens（）

将大写字母转换为小写字母

mydata <- data.frame(id = 1:2,
                     text = c("112773-Tablet for cleaning the hydraulic system Jura (6 pcs.) 62715",
                              "1234567-Tablet for cleaning the mambojumbo system Jura (12 pcs.) 654321"),
                     stringsAsFactors = F)

library(dplyr)
library(tidytext)

data(stop_words)

mutate(mydata, text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
unnest_tokens(input = text, output = word) %>%
filter(!word %in% c(stop_words$word, "jura")) %>%
group_by(id) %>%
summarise(text = paste(word, collapse = " "))

#     id                              text
#  <int>                             <chr>
#1     1  tablet cleaning hydraulic system
#2     2 tablet cleaning mambojumbo system

有多种方法可以做到这一点。如果您只想依赖base R，可以稍微转换@jazurro的答案，然后使用

gsub（）

查找并替换要删除的文本模式

我将使用两个正则表达式来实现这一点：第一个正则表达式匹配括号和数值的内容，而第二个正则表达式将删除停止字。第二个正则表达式必须基于要删除的停止字构造。如果我们将其全部放在一个函数中，您可以使用

sapply

轻松地将其应用于所有字符串：

mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)

custom_filter <- function(string, stopwords=c()){
  string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
  # Create something like:  "\\b( the|Jura)\\b"
  new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
  gsub(new_regex, "", string)
}

stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system  "

mytextdata您能提供一个示例数据吗？以及您迄今为止尝试过的代码，以及您预期的结果。这是一项常见的文本挖掘任务，也许搜索类似的资源会让您受益匪浅going@jazzurro，我用样本编辑了这篇文章data@Nate，我编辑了这篇文章，我希望您需要删除（）中的任何内容吗？可以只保留名词和形容词（如果有）（平板电脑清洗）是的，是的。我的支票。不删除动词，您可以稍微调整答案，只保留名词和形容词。如果您有包含单词和词汇类别（例如，名词、动词和形容词）的数据集，则可以只保留名词和形容词（如果有）（平板清理）@D.Joe，我想你可以这样做。如果你想得到一个只包含名词和形容词的数据集，并且想要处理整洁的数据，我会检查cleanNLP包并用它做词性标记。@JuliaSilge我以前不知道这个包。我去看看。非常感谢。
mytextdata <- read.csv("123.csv", header=FALSE, stringsAsFactors=FALSE)

custom_filter <- function(string, stopwords=c()){
  string <- gsub("[-0-9]+|\\(.*\\) ", "", string)
  # Create something like:  "\\b( the|Jura)\\b"
  new_regex <- paste0("\\b( ", paste0(stopwords, collapse="|"), ")\\b")
  gsub(new_regex, "", string)
}

stopwords <- c("the", "Jura")
custom_filter(mytextdata[1], stopwords)
# [1] "Tablet for cleaning hydraulic system  "