R 辅助创建单词包模型

R 辅助创建单词包模型,r,dplyr,tidyr,tibble,R,Dplyr,Tidyr,Tibble,免责声明:这是家庭作业的一部分 我有一组推特,我需要创建一个分类器来尝试和预测他们的情绪。为此,我将创建一个单词包模型,并对数据应用径向SVM核函数 以下是原始数据,让您了解: > original_tweets # A tibble: 2,385 x 3 tweet_id sentiment text

免责声明:这是家庭作业的一部分

我有一组推特,我需要创建一个分类器来尝试和预测他们的情绪。为此,我将创建一个单词包模型,并对数据应用径向SVM核函数

以下是原始数据,让您了解:

> original_tweets
# A tibble: 2,385 x 3
   tweet_id sentiment text                                                                                                                      
      <int> <chr>     <chr>                                                                                                                     
 1        1 positive  @TylerSkewes: It is almost 2014. Where are the self-driving cars so we don't have to worry about a DD tonight. Forreal tho
 2        2 positive  @WIRED: BMW builds a self-driving car -- that drifts I love this technology. Drive me to work baby!
 3        3 positive  Google better hurry up with that driverless car. Watching grandma do an 8 point turn to get in a parking spot is horrific.
 4        4 positive  I just waved thank you to this lady that let me merge on the highway and she gave me the finger. Need my self driving car.
 5        5 positive  I might be the only person who starts #cheering in their car when they see a @google car :) #happiness #feelslikeChristmas
 6        6 positive  I want the driverless car, and BAD. Seriously I would be happy if tomorrow morning there were no drivers behind the wheel.
 7        7 positive  I'm over here writing a 2000 word essay while *****s at Google are on driverless cars making ground breaking shit. Damn. _
 8        8 positive  Is it crazy to think that self driving cars will be the biggest innovation of the last few decades? 
 9        9 positive  Its very nice!RT @cdixon: It's awesome that Google is investing in futuristic stuff like AR glasses and self-driving cars.
10       10 positive  Look closely you will see the reflection of a google car !!!! Screen shot from google maps !!!!!
# ... with 2,375 more rows
> 
>原创推文
#A tibble:2385x3
tweet_id情感文本
1 1积极@TylerSkewes:快到2014年了。自动驾驶汽车在哪里,所以我们今晚不必担心DD。弗雷尔透
2积极@WIRED:BMW制造了一款自动驾驶汽车——这让我很喜欢这项技术。开车送我去上班,宝贝!
那辆无人驾驶汽车最好快点。看着奶奶在停车场转了8个弯真是太恐怖了。
我刚刚向这位女士挥手致谢,她让我在高速公路上汇合,她给了我手指。我需要我的自动驾驶汽车。
5肯定的是,我可能是唯一一个开始(当他们看到一辆@google car:)时,在车里欢呼的人;#幸福#感觉像是一场灾难
6肯定我想要无人驾驶汽车,糟糕。说真的,如果明天早上没有司机开车,我会很高兴的。
7肯定我在这里写了一篇2000字的文章,而谷歌的****们正在开发无人驾驶汽车,制造破天荒的垃圾。该死_
8积极的想法认为自动驾驶汽车将是过去几十年中最大的创新是疯狂的吗?
这太好了!RT@cdixon:谷歌正在投资未来派的东西,比如AR眼镜和自动驾驶汽车,这真是太棒了。
10仔细看,你会看到谷歌汽车的倒影!!!!谷歌地图截图!!!!!
# ... 还有2375行
> 
我稍微编辑了一些术语,因为它们有URL,但你明白了

我已将数据格式化为整洁的格式,并计算了每个术语的TF-IDF分数。就我的功能空间而言,我选择了IDF得分最高的1000个术语

以下是我的数据示例:

> feature_space
# A tibble: 3,000 x 7
   tweet_id sentiment word                   n     tf   idf tf_idf
      <int> <chr>     <chr>              <int>  <dbl> <dbl>  <dbl>
 1        1 positive  forreal                1 0.0435  7.78  0.338
 2        2 positive  drifts                 1 0.0476  7.78  0.370
 3        2 positive  rprjtelkg6             1 0.0476  7.78  0.370
 4        5 positive  cheering               1 0.0455  7.78  0.353
 5        5 positive  feelslikechristmas     1 0.0455  7.78  0.353
 6        7 positive  2000                   1 0.0476  7.78  0.370
 7        7 positive  *****s                 1 0.0476  7.78  0.370
 8        8 positive  decades                1 0.0417  7.78  0.324
 9        8 positive  vltlymug89             1 0.0417  7.78  0.324
10        9 positive  ar                     1 0.0476  7.78  0.370
# ... with 2,990 more rows
>功能空间
#一个tibble:3000x7
tweet_id情感词n tf idf tf_idf
1实际值为正1 0.0435 7.78 0.338
2 2正漂移1 0.0476 7.78 0.370
3 2正rprjtelkg6 1 0.0476 7.78 0.370
4 5积极欢呼1 0.0455 7.78 0.353
5 5积极的感觉,如1 0.0455 7.78 0.353
6 7正2000 1 0.0476 7.78 0.370
7 7正****s1 0.0476 7.78 0.370
8 8正十年10.0417 7.78 0.324
9 8阳性vltlymug89 1 0.0417 7.78 0.324
109正ar 10.0476 7.78 0.370
# ... 还有2990行
我想使用他们的TF-IDF分数创建一个单词包模型,以创建一个情感分类器。对于这个模型,我知道我需要设置我的数据帧,以便每个tweet都是一行,每个可能的TF-IDF术语权重在我的特征空间中都是一列

我很难弄清楚如何最好地改变TIBLE或数据帧,以将数据转换成这种格式。我尝试过mutate()和join()的各种组合,但从来都不是我想要的方式

如何基于一组特征词向数据帧或TIBLE快速添加3000列或更多列,并应用它们的TF-IDF值填充此稀疏数据结构?我不一定需要一个直接的代码答案,但在如何在R中实现这一点的正确方向上迈出的一步将对我有很大帮助

更新:我现在有一个空的tibble用于我的文字包,我只需要在数据中填入非零的TF-DF值。这是:

    > bag_of_words
# A tibble: 2,385 x 3,002
   tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89    ar closely reflection zg7hvvfgpn
      <int> <chr>       <dbl>  <dbl>      <dbl>    <dbl>              <dbl>  <dbl>  <dbl>   <dbl>      <dbl> <dbl>   <dbl>      <dbl>      <dbl>
 1        1 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 2        2 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 3        3 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 4        4 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 5        5 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 6        6 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 7        7 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 8        8 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
 9        9 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
10       10 positive        0      0          0        0                  0      0      0       0          0     0       0          0          0
# ... with 2,375 more rows, and 2,987 more variables
>一袋一袋的单词
#A tibble:2385x3002
对于真实漂泊者的情绪,推特上的欢呼感觉就像是2000年的悲剧,几十年来的经历与zg7hvvfgpn密切相关
1 1正0 0 0 0 0 0 0 0 0 0 0 0 0
2 2正0 0 0 0 0 0 0 0 0 0 0 0
3 3正0 0 0 0 0 0 0 0 0 0 0 0
4 4正0 0 0 0 0 0 0 0 0 0 0 0
55正0 0 0 0 0 0 0 0 0 0 0 0
6 6正0 0 0 0 0 0 0 0 0 0 0
7 7正0 0 0 0 0 0 0 0 0 0 0
8 8正0 0 0 0 0 0 0 0 0 0 0
9 9正0 0 0 0 0 0 0 0 0 0 0
1010正0 0 0 0 0 0 0 0 0 0 0 0
# ... 还有2375行和2987个变量

好的,我想我有一个解决办法
#create bag of words model
#get tweet_id and sentiment
bag_of_words <- original_tweets %>%
  select(-one_of('text'))

#get words from feature space
feature_words <- feature_space$word

#generate empty columns
for(i in feature_words)
  bag_of_words[,i] <- 0

#fill in columns with values from feature space
for(i in 1:length(feature_words)) {
  word <- feature_space[i,]$word
  tweet <- feature_space[i,]$tweet_id
  score <- feature_space[i,]$tf_idf
  bag_of_words[tweet,word] <- score
}
> bag_of_words
# A tibble: 2,385 x 3,002
   tweet_id sentiment forreal drifts rprjtelkg6 cheering feelslikechristmas `2000` *****s decades vltlymug89    ar closely reflection zg7hvvfgpn
      <int> <chr>       <dbl>  <dbl>      <dbl>    <dbl>              <dbl>  <dbl>  <dbl>   <dbl>      <dbl> <dbl>   <dbl>      <dbl>      <dbl>
 1        1 positive    0.338  0          0        0                  0      0      0       0          0     0       0          0          0    
 2        2 positive    0      0.370      0.370    0                  0      0      0       0          0     0       0          0          0    
 3        3 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 4        4 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 5        5 positive    0      0          0        0.353              0.353  0      0       0          0     0       0          0          0    
 6        6 positive    0      0          0        0                  0      0      0       0          0     0       0          0          0    
 7        7 positive    0      0          0        0                  0      0.370  0.370   0          0     0       0          0          0    
 8        8 positive    0      0          0        0                  0      0      0       0.324      0.324 0       0          0          0    
 9        9 positive    0      0          0        0                  0      0      0       0          0     0.370   0          0          0    
10       10 positive    0      0          0        0                  0      0      0       0          0     0       0.370      0.370      0.370
# ... with 2,375 more rows, and 2,987 more variables