在R中组合nest（）和aggregate（）？_R_Dplyr_Rtweet

在R中组合nest（）和aggregate（）？

在R中组合nest（）和aggregate（）？,r,dplyr,rtweet,R,Dplyr,Rtweet,正在寻求一些帮助和建议：我用rtweet软件包收集推文。这给了我一个数据框，其中观察值（即tweets）列在行中，变量列在列中。变量既在tweet级别（如文本、喜好、hashtags等），也在帐户级别（关注者数量、个人简历等）。我对tweet进行了情绪分析，在数据框中添加了tweet级别情绪分数的变量要模拟我的数据现在的样子（实际上我有100000多个OB.和115个VAR）：现在，我想做的是在用户帐户级别上工作。为此，我想将每个用户的喜好和情感平均得分相加，同时将每个用户的所有tweet

正在寻求一些帮助和建议：

我用rtweet软件包收集推文。这给了我一个数据框，其中观察值（即tweets）列在行中，变量列在列中。变量既在tweet级别（如文本、喜好、hashtags等），也在帐户级别（关注者数量、个人简历等）。我对tweet进行了情绪分析，在数据框中添加了tweet级别情绪分数的变量

要模拟我的数据现在的样子（实际上我有100000多个OB.和115个VAR）：

现在，我想做的是在用户帐户级别上工作。为此，我想将每个用户的喜好和情感平均得分相加，同时将每个用户的所有tweet文本合并成一个向量（或者一个长字符串也可以）。简历不应该合并

一般来说，聚合不是问题：

df%>% 
  group_by(users)%>%
  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))

就嵌套数据而言，我得出如下结论：

data %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)

将两者结合在一段代码中并没有任何意义。我分别运行了这两个操作，并使用了internal_join（），它似乎工作得很好，但是这个方法非常麻烦，因为我有115个变量

d1<- df %>%
  select(-likes, -sentiment) %>%
  nest(-users, -followers, -bio)

d2 <- df %>%
  group_by(users)%>%
  summarise(meanlikes = mean(likes),
            meansentiment = mean(sentiment))

d1 <- d1 %>%
  inner_join(d2)

希望你能在这里帮助我

您可以尝试以下方法：

# set seed to make df reproducible
set.seed(1234)

df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
                 text = c('this is u1 first tweet', 
                          'this is another tweet', 
                          'hello hello', 
                          'hashtag tweettext',
                          'tweet text',
                          'this is u1 second tweet',
                          'this is u6 first tzeet',
                          'this is u6 second tweet',
                          'this is u6 third tweet',
                          'this is u1 third tweet'),
                 likes= sample(1:10, 10),
                 sentiment= rnorm(10, mean=0, sd=1),
                 followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
                 bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))


df %>% group_by(users)%>%
  mutate(tweets = str_c(text, collapse = ""),
         meanlikes = mean(likes),
         meansentiment = mean(sentiment)) %>%
  select(-text, -likes, -sentiment) %>%
  distinct()

#设置种子以使df可复制
种子集（1234）
df%组由（用户）%>%
变异（tweets=str_c（text，collapse=”“），
meanlikes=平均（喜欢），
平均情绪=平均（情绪））%>%
选择（-text，-likes，-interaction）%>%
不同的（）

您可以按

用户对您进行分组

，首先保留

bio的值
和追随者
，因为他们都是相同的。使用toString
将likes
和情绪
的mean
折叠成一个逗号分隔的字符串
library(dplyr)

df %>%
  group_by(users) %>%
  summarise(across(c(bio, followers), first),
            across(c(likes, sentiment), mean), 
            text = toString(text))

#  users bio      followers likes sentiment text             
#  <chr> <chr>        <dbl> <dbl>     <dbl> <chr>            
#1 u1    lorem i…       111  6.67    0.0870 this is u1 first…
#2 u2    lorem i…       200  8      -0.945  this is another …
#3 u3    lorem i…       300  6       0.225  hello hello      
#4 u4    lorem i…       400  3       0.359  hashtag tweettext
#5 u5    lorem i…       500  5      -0.664  tweet text       
#6 u6    lorem i…       666  4.33    0.206  this is u6 first…

库（dplyr）
df%>%
分组依据（用户）%>%
总结（首先是c（简历、追随者），
跨越（c（喜欢，情绪），平均值），
text=toString（text））
#用户喜欢情绪文本
#                              
#1 u1 lorem i…1116.670.0870这是u1第一个…
#2 u2 lorem i…200 8-0.945这是另一个…
#3 u3 lorem i…300 6 0.225你好
#4 u4 lorem i…400 3 0.359标签推文
#5 u5 lorem i…500 5-0.664推文
#6 u6 lorem i…666 4.33 0.206这是u6第一…
非常感谢您！这种解决方案似乎是最优雅、最节省的。肯定做了！
  users                                                                    text followers
1    u1 this is u1 first tweet, this is u1 second tweet, this is u1 third tweet       111
2    u2                                                   this is another tweet       200
3    u3                                                             hello hello       300
4    u4                                                       hashtag tweettext       400
5    u5                                                              tweet text       500
6    u6 this is u6 first tzeet, this is u6 second tweet, this is u6 third tweet       666
             bio meanlikes meansentiment
1 lorem ipsum u1  4.333333    -0.2846824
2 lorem ipsum u2  6.000000    -0.5443194
3 lorem ipsum u3  2.000000     1.8001123
4 lorem ipsum u4  4.000000     1.0114402
5 lorem ipsum u5  9.000000    -0.5637166
6 lorem ipsum u6  7.000000     1.2346833

# set seed to make df reproducible
set.seed(1234)

df <- data.frame(users = c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1'),
                 text = c('this is u1 first tweet', 
                          'this is another tweet', 
                          'hello hello', 
                          'hashtag tweettext',
                          'tweet text',
                          'this is u1 second tweet',
                          'this is u6 first tzeet',
                          'this is u6 second tweet',
                          'this is u6 third tweet',
                          'this is u1 third tweet'),
                 likes= sample(1:10, 10),
                 sentiment= rnorm(10, mean=0, sd=1),
                 followers = c(111, 200, 300, 400, 500, 111, 666, 666, 666, 111),
                 bio = paste0(rep('lorem ipsum', 10), " ", c('u1', 'u2', 'u3', 'u4', 'u5', 'u1', 'u6', 'u6', 'u6', 'u1')))


df %>% group_by(users)%>%
  mutate(tweets = str_c(text, collapse = ""),
         meanlikes = mean(likes),
         meansentiment = mean(sentiment)) %>%
  select(-text, -likes, -sentiment) %>%
  distinct()



library(dplyr)

df %>%
  group_by(users) %>%
  summarise(across(c(bio, followers), first),
            across(c(likes, sentiment), mean), 
            text = toString(text))

#  users bio      followers likes sentiment text             
#  <chr> <chr>        <dbl> <dbl>     <dbl> <chr>            
#1 u1    lorem i…       111  6.67    0.0870 this is u1 first…
#2 u2    lorem i…       200  8      -0.945  this is another …
#3 u3    lorem i…       300  6       0.225  hello hello      
#4 u4    lorem i…       400  3       0.359  hashtag tweettext
#5 u5    lorem i…       500  5      -0.664  tweet text       
#6 u6    lorem i…       666  4.33    0.206  this is u6 first…