R 将gsub函数与文本文件配对以清理语料库_R_Text_Text Files_Gsub_Data Cleaning

R 将gsub函数与文本文件配对以清理语料库

r text

R 将gsub函数与文本文件配对以清理语料库,r,text,text-files,gsub,data-cleaning,R,Text,Text Files,Gsub,Data Cleaning,我有一个很大的推特样本，在分析它们之前，我正试图清理它们。我把推文放在一个数据框中，每个单元格都有一条推文的内容（例如，“我爱旧金山”和“自豪的空军成员”）。然而，当我在网络可视化中分析文本时，每个bio中都有一些单词应该组合在一起。我还想结合常用的两个词短语（例如，“纽约”、“旧金山”和“空军”）。我已经编译了需要合并的术语列表，并使用gsub将其中一些术语与这一行代码合并： twitterdata_cleaning$bio = gsub('air force','airforce',twit

我有一个很大的推特样本，在分析它们之前，我正试图清理它们。我把推文放在一个数据框中，每个单元格都有一条推文的内容（例如，“我爱旧金山”和“自豪的空军成员”）。然而，当我在网络可视化中分析文本时，每个bio中都有一些单词应该组合在一起。我还想结合常用的两个词短语（例如，“纽约”、“旧金山”和“空军”）。我已经编译了需要合并的术语列表，并使用

gsub

将其中一些术语与这一行代码合并：

twitterdata_cleaning$bio = gsub('air force','airforce',twitterdata_cleaning$bio)

上面的代码行将

“自豪的空军成员”

变成

“自豪的空军成员”

。我已经能够用几十个两个单词的短语成功地做到这一点

然而，我在bios中有数百个双字短语，我想更好地跟踪它们，所以我将所有这些术语移动到excel文件的两列中。我想找到一种在txt或excel文件中使用上述公式的方法，该公式识别数据框中与txt文件第一列中的术语相似的术语，并将这些术语更改为与txt文件第二列中的术语相似

例如，我的xlsx和txt文件如下所示：

    **column1**               **column2*
   san francisco              sanfrancisco
     new york                   newyork
     las vegas                  lasvegas
     san diego                  sandiego
   new hampshire              newhampshire
      good bye                   goodbye
      air force                  airforce
     video game                 videogame
    high school                  school
    middle school                school
    elementary school            school

我想在一个公式中使用

gsub

命令，在数据框中搜索

第1列中的所有术语

，并使用类似于以下内容的术语将它们转换为

第2列中的术语：
twitterdata_df$tweet = gsub('textfile$column1','textfile$columnb',twitterdata_df$tweet)

要在单元格中获得类似的内容：
i love sanfrancisco
can not wait to go to newyork
what happens in lasvegas stays there
at the beach in sandiego
can beat the autumn leave in newhampshire
so done with all the drama goodbye
proud member of the airforce
love this videogame so much
playing at the school tonight 
so sick of school
school was the best and i miss it

任何帮助都将不胜感激 广义解
您可以从packagestringr
将命名向量输入到str\u replace\u all（）
，以完成此操作。在我的示例中，df
有一个列，列中的old
值将替换为new
值。我想这就是你用Excel文件来跟踪他们的意思
library(stringr)

df <- data.frame(old = c("five", "six", "seven"),
                 new = as.character(5:7),
                 stringsAsFactors = FALSE)

text <- c("I am a vector with numbers six and other text five",
          "another vector seven six text five")

str_replace_all(text, setNames(df$new, df$old))


具体例子
数据
读入包含替换项的文本文件
textfile <- read.csv(textConnection("column1,column2
san francisco,sanfrancisco
new york,newyork
las vegas,lasvegas
san diego,sandiego
new hampshire,newhampshire
good bye,goodbye
air force,airforce
video game,videogame
high school,school
middle school,school
elementary school,school"), stringsAsFactors = FALSE)

更换
twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))

谢谢你的帮助，但我知道怎么做了。我决定使用一个循环，它进入我的两列表，在第一列中搜索每一组术语，并用第二列中的单词替换它们
 for(i in 1:nrow(compoundterms)) {
            twitterdata_dfg$tweet = gsub(compoundterms[i,1],compoundterms[i,2],twitterdata_df$tweet)
    }

您正在寻找adist
函数或等效函数，如矢量化agrep等。您的问题陈述不清楚。Hanks Adam，我们无法确切地告诉您拥有什么以及您想要获得什么，但是当我尝试使用我的数据帧运行str_replace_all函数时，我得到了一个错误：error in use method（“type”）：没有适用于“type”的方法应用于类“factor”的对象，您的数据正在读取并将文本转换为factors。这就是为什么我有stringsAsFactors=FALSE
部分。您使用什么函数读取数据？通常也有类似的选择。或者，只需使用as.character（）
.Adam转换事实之后的列，谢谢。我让stringsAsFactors开始工作，但代码仍然不起作用。我意识到我没有提供足够的信息，例如术语与其他单词共享单元格的事实，因此您提供的代码不起作用。此后，我更新了我的原始帖子，提供了更多关于我需要做什么的具体信息。谢谢你帮助我。我很感激。我想我现在有点困惑了。哪一个不起作用？该代码应该处理一个输入向量，并返回一个向量，并进行替换。“单元格”是指要替换的一组旧/新术语（例如，“纽约”和“纽约”），还是twitterdata\u df$tweet向量的每个元素？嘿，亚当，很抱歉延迟响应！我想出来了。我将把我所做的作为回答。
twitterdata_df$tweet2 <- str_replace_all(twitterdata_df$tweet, setNames(textfile$column2, textfile$column1))

   id                                        tweet                                    tweet2
1   1                         i love san francisco                       i love sanfrancisco
2   2               can not wait to go to new york             can not wait to go to newyork
3   3        what happens in las vegas stays there      what happens in lasvegas stays there
4   4                    at the beach in san diego                  at the beach in sandiego
5   5   can beat the autumn leave in new hampshire can beat the autumn leave in newhampshire
6   6           so done with all the drama goodbye        so done with all the drama goodbye
7   7                proud member of the air force              proud member of the airforce
8   8                 love this video game so much               love this videogame so much
9   9           playing at the high school tonight             playing at the school tonight
10 10                     so sick of middle school                         so sick of school
11 11 elementary school was the best and i miss it         school was the best and i miss it

 for(i in 1:nrow(compoundterms)) {
            twitterdata_dfg$tweet = gsub(compoundterms[i,1],compoundterms[i,2],twitterdata_df$tweet)
    }