使用grep函数进行文本挖掘_R_Twitter_Text Mining

使用grep函数进行文本挖掘

r twitter

使用grep函数进行文本挖掘,r,twitter,text-mining,R,Twitter,Text Mining,我在为数据打分时遇到问题。下面是数据集。文本是我想要进行文本挖掘和情感分析的推文 **text** **call bills location** -the bill was not generated 0 bill 0 -tried to raise the complaint

我在为数据打分时遇到问题。下面是数据集。文本是我想要进行文本挖掘和情感分析的推文

**text**                                         **call    bills    location**
-the bill was not generated                           0        bill       0
-tried to raise the complaint                         0         0         0 
-the location update failed                           0         0       location
-the call drop has increased in my location         call        0       location
-nobody in the location received bill,so call ASAP  call      bill      location

这是虚拟数据，其中文本是我试图从中进行文本挖掘的列，我在R中使用了grep函数来创建列（例如账单、电话、位置），如果账单在任何行中，则在列名下写入账单，同样，对于所有其他类别

vdftweet$app = ifelse(grepl('app',tolower(vdftweet$text)),'app',0)
table(vdftweet$app)

现在，我无法理解的问题是

我想创建一个新的列“category_name”，在该列下，每一行都应该给出它们所属类别的名称。如果每条tweet有3个以上的类别，则将其标记为“其他”。否则请给出类别名称。

以下是一种使用

apply

并检查列名称是否与行中的条目相交的方法：

df1文本账单呼叫位置应用程序
#>1废话账单0 0
#>2废话呼叫0呼叫0
#>3该位置未能更新0位置0
#>4账单，呼叫，位置，废话账单呼叫位置0
#>5号账单空白位置账单0位置0
#>6账单，电话，位置，应用账单呼叫位置应用
df1$category_name 2空话呼叫0呼叫0
#>3该位置未能更新0位置0
#>4账单，呼叫，位置，废话账单呼叫位置0
#>5号账单空白位置账单0位置0
#>6账单，电话，位置，应用账单呼叫位置应用
#>类别名称
#>1条例草案
#>2号电话
#>3地点
#>4账单、电话、地点
#>5帐单，地点
#>6其他

如果列的名称与您正在搜索的术语不对应，但这些术语存储在某些向量中，例如

键

，则相同的方法只需在代码中出现

名称（行）

的任何位置插入

键

由（v0.2.0）于2018年5月10日创建。

您可以使用

tidyverse

软件包实现这一点。在第一种方法中，

mutate

用于将类别名称作为列添加到文本data.frame中，类似于您所拥有的<然后使用代码>聚集将其转换为键值格式，其中类别是

类别名称

列中的值

另一种方法是直接转到键值格式，其中类别是

category\u name

列中的值。如果行属于多个类别，则会重复这些行。如果您不需要以类别作为列名的第一个表单，那么另一种方法在添加新类别时更加灵活，并且需要更少的处理

在这两种方法中，

str_match

包含将类别与文本匹配的正则表达式。这里的模式很简单，但如果需要，可以使用更复杂的模式

守则如下：

library(tidyverse)
#
# read dummy data into data frame
#
   dummy_dat <- read.table(header = TRUE,stringsAsFactors = FALSE, 
                      strip.white=TRUE, sep="\n",
          text= "text
            -the bill was not generated
          -tried to raise the complaint
          -the location update failed
          -the call drop has increased in my location
          -nobody in the location received bill,so call ASAP")
#
#  form data frame with categories as columns
#
   dummy_cats <-  dummy_dat %>% mutate(text = tolower(text),
                               bill = str_match(.$text, pattern="bill"), 
                               call = str_match(.$text,  pattern="call"), 
                               location = str_match(.$text, pattern="location"),
                               other = ifelse(is.na(bill) & is.na(call) &
                                              is.na(location), "other",NA))
#
#  convert categories as columns to key-value format
#  withcategories as values in category_name column
#

   dummy_cat_name <- dummy_cats %>% 
               gather(key = type, value=category_name, -text,na.rm = TRUE) %>%
               select(-type) 

#
#---------------------------------------------------------------------------
#
#  ALTERNATIVE:  go directly from text data to key-value format with categories
#  as values under category_name
#  Rows are repeated if they fall into multiple categories
#  Rows with no categories are put in category other
#
   dummy_dat <- dummy_dat %>% mutate(text=tolower(text))
   dummy_cat_name1 <- data.frame(text = NULL, category_name =NULL)
   for( cat in c("bill", "call", "location")) {
      temp <-  dummy_dat %>% mutate(category_name = str_match(.$text, pattern=cat)) %>% na.omit() 
      dummy_cat_name1 <- dummy_cat_name1 %>% bind_rows(temp) 
    }
    dummy_cat_name1 <- left_join(dummy_dat, dummy_cat_name1, by = "text") %>%
               mutate(category_name = ifelse(is.na(category_name), "other", category_name))

谢谢，这很好，但是对于大数据集，我将创建n个组合。如果每行有多个，是否可以复制每行category@ASSOND我不确定我是否完全理解你的意思，你是在问你是否可以返回两列，而不是返回“位置，比尔”（一列带有“位置”，另一列带有“比尔”）？不，我希望有一列是，类别名称，它应该给我比尔，相对于行的位置和调用。但如果出现多个类别（账单、电话、地点），则应复制同一行并给出值。e、 g.文本是（我所在位置的语音质量很差，无法拨打电话）现在这条推文有两个类别，呼叫和位置，因此应该形成类别名称列，在第一个字段中复制行，在另一个“位置”中的值是“呼叫”，这不是我编写的函数的功能吗？例如，如果文本包含呼叫，而位置类别包含“呼叫，位置”？否则我不明白你的预期结果。你应该编辑你的问题，写出你想要的结果。哦，太好了！我一直在找这个。thanks@AAOND很高兴它对你有用。您可以选中“已接受”框，以便其他人在遇到类似问题时知道这是一个解决方案。

 dummy_cat_name1
                                            text      category_name
                            -the bill was not generated          bill
                          -tried to raise the complaint         other
                            -the location update failed      location
            -the call drop has increased in my location          call
            -the call drop has increased in my location      location
     -nobody in the location received bill,so call asap          bill
     -nobody in the location received bill,so call asap          call
     -nobody in the location received bill,so call asap      location