R 计算给定条件下的百分比
我对这个网站和编码都是新手。我想知道你们中是否有人能帮我 我需要计算排名前5位的电影,通过评级分布,计算每部电影的4星级或更高的评级百分比 到目前为止,我只能使用dplyr计算出现的次数 是否可以使用dplyr(类似于我的编码)计算它 我不确定我是否需要变异来找到解决方案,或者是否有其他方法可以做到这一点 到目前为止,我的代码是:R 计算给定条件下的百分比,r,dplyr,conditional,percentage,mutate,R,Dplyr,Conditional,Percentage,Mutate,我对这个网站和编码都是新手。我想知道你们中是否有人能帮我 我需要计算排名前5位的电影,通过评级分布,计算每部电影的4星级或更高的评级百分比 到目前为止,我只能使用dplyr计算出现的次数 是否可以使用dplyr(类似于我的编码)计算它 我不确定我是否需要变异来找到解决方案,或者是否有其他方法可以做到这一点 到目前为止,我的代码是: dfAux1 <- na.omit(dfAux) dfAux1 %>% group_by(movie) %>% summarise(tot
dfAux1 <- na.omit(dfAux)
dfAux1 %>%
group_by(movie) %>%
summarise(tot = n()) %>%
arrange(desc(tot))%>%
head(5)
到目前为止,这是我的结果:
# A tibble: 5 x 2
movie tot
<fctr> <int>
1 Toy Story 17
2 The Silence of the Lambs 16
3 Star Wars IV - A New Hope 15
4 Star Wars VI - Return of the Jedi 14
5 Independence Day 13
我使用的是
data.table
而不是dplyr
library(data.table)
setDT(dfAux1) # make dfAux1 as data table by reference
# calculate total number by movies, then compute percent for `Rating >= 4` by movies and then sort `tot` by descending order and also eliminating duplicates in movies using `.SD[1]` which gives the first row in each movie.
dfAux1[, .(Rating, tot = .N), by = movie ][Rating >= 4, .(percent = .N/tot, tot), by = movie ][order(-tot), .SD[1], by = movie]
# movie percent tot
# 1: Toy Story 0.35294118 17
# 2: The Silence of the Lambs 0.43750000 16
# 3: Star Wars IV - A New Hope 0.53333333 15
# 4: Star Wars VI - Return of the Jedi 0.35714286 14
# 5: Independence Day 0.30769231 13
# 6: Gladiator 0.50000000 12
# 7: Total Recall 0.08333333 12
# 8: Groundhog Day 0.41666667 12
# 9: The Matrix 0.41666667 12
# 10: Schindler's List 0.33333333 12
# 11: The Sixth Sense 0.33333333 12
# 12: Saving Private Ryan 0.36363636 11
# 13: Pulp Fiction 0.36363636 11
# 14: Stand by Me 0.36363636 11
# 15: Shakespeare in Love 0.27272727 11
# 16: Raiders of the Lost Ark 0.27272727 11
# 17: Forrest Gump 0.30000000 10
# 18: The Shawshank Redemption 0.70000000 10
# 19: Babe 0.40000000 10
# 20: Blade Runner 0.44444444 9
概述
我使用该软件包按movie
列对数据进行分组,并根据rating
列执行计算
在中,我创建了三个新列:
Total\u Review
:统计每部电影的总评论数
FourPlus\u评分
:统计评分值为4或更高的审核子集Per\u FourPlus\u评级
:将FourPlus\u评级
除以Total\u Review
#安装必要的软件包
安装程序包(pkgs=“dplyr”)
#加载必要的包
图书馆(dplyr)
#查看前六行
水头(x=df)
#分级电影
#1《星球大战四》新希望
#2.5《星球大战四》新希望
#5 4星球大战四-新希望
#6.2《星球大战四》新希望
#8.4《星球大战四》新希望
#9.5《星球大战四》新希望
#使用
#dplyr函数
df%>%
分组人(电影)%>%
总结(总审查=n()
,FourPlus_额定值=长度(额定值[其中(额定值>=4)])
,Per_FourPlus_Rating=长度(Rating[which(Rating>=4)])/n())%>%
排列(描述(每四加评级))%>%
水头(n=5)
#一个tibble:5x4
#电影总回顾FourPlus按比率的分级…
#
#1肖申克红…10 7 0.700
#2《星球大战四》A N…15 8 0.533
#3角斗士12 6 0.500
#4叶片转轮9 4 0.444
#5…的沉默16 7 0.438
#脚本结束#
使用数据的单线解决方案。表
和OP中的数据可以如下所示:
library(data.table)
setDT(dfAux1)[, .(pct = sum(Rating>=4)/.N), by=movie][order(-pct)][1:5]
movie pct
1: The Shawshank Redemption 0.7000000
2: Star Wars IV - A New Hope 0.5333333
3: Gladiator 0.5000000
4: Blade Runner 0.4444444
5: The Silence of the Lambs 0.4375000
这是一个dplyr解决方案:
dfAuxhigh=filter(dfAux1,Rating>=4)%>%group_by(movie)%>%summarize(percentHigh=n())
dfAux=dfAux1%>%group_by(movie)%>%summarize(percentAll=n())
result<-merge(dfAuxhigh,dfAux,by="movie")%>%mutate(percentage=percentHigh/percentAll)
result<-result[order(result$percentage,decreasing = T)[1:5],c(1,4)]
dfAuxhigh=filter(dfAux1,评级>=4)%%>%groupby(movie)%%>%summary(percentHigh=n())
dfAux=dfAux1%%>%group_by(电影)%%>%SUMMARY(percentAll=n())
结果%变异(百分比=百分比高/百分比全)
结果<代码>库(tidyverse)
df%>%
组别(电影、评级)%>%
总结(n=n())%>%#%#<查找perc
过滤器(额定值>=4)%>%#<所需额定值的过滤器(4或以上)
总结(freq=sum(freq))%>%#<再次总结
排名靠前的(5)%>%
排列(描述(频率))%>%
变异(freq=paste0(四舍五入(freq*100,2),“%”)
#>电影频率
#>1《肖申克的救赎》70%
#>2《星球大战四》新希望53.33%
#>3角斗士50%
#>4叶片转轮44.44%
#>5羔羊的沉默43.75%
请dput
并共享包含电影详细信息的数据帧。您应该使用dput(dfAux1)
并请共享输出str
帮不了什么忙。我帮了,看起来糟透了。请使用数据查看我的单行解决方案。表
。这并不是我想要做的,例如,《角斗士》有12篇评论,在这12篇评论中,有6篇被评为4或5,因此我要找的数字是50%。这很有效!谢谢,我只需要按前5名排序。我想这应该是(至少是我想要的)你所做的事情,唯一的问题是我需要计算发生的次数,而不是求和。例如,《角斗士》有12篇评论,其中6篇评为>=4,应该得到0.5分哦,你修好了,谢谢!它工作了,无法计算它的长度…我真的很感谢你帮助我,谢谢你分享所需的输出!我在上面加了一些解释。希望这有帮助!谢谢你的解释。可能您可以删除df
的定义,该定义已包含在OP中。这将使你的帖子非常清晰易懂。谢谢@MKR!我不知道它看起来有多难看哈哈,现在干净多了!非常感谢。我不知道你只用一行代码就能做到,我是高兴还是生气呵呵
# install necessary package
install.packages( pkgs = "dplyr" )
# load necessary package
library( dplyr )
# view first six rows
head( x = df )
# Rating movie
# 1 1 Star Wars IV - A New Hope
# 2 5 Star Wars IV - A New Hope
# 5 4 Star Wars IV - A New Hope
# 6 2 Star Wars IV - A New Hope
# 8 4 Star Wars IV - A New Hope
# 9 5 Star Wars IV - A New Hope
# perform calculations using
# dplyr functions
df %>%
group_by( movie ) %>%
summarise( Total_Review = n()
, FourPlus_Rating = length( Rating[ which( Rating >= 4 ) ] )
, Per_FourPlus_Rating = length( Rating[ which( Rating >= 4 ) ] ) / n() ) %>%
arrange( desc( Per_FourPlus_Rating ) ) %>%
head( n = 5 )
# A tibble: 5 x 4
# movie Total_Review FourPlus_Rating Per_FourPlus_Rati…
# <fct> <int> <int> <dbl>
# 1 The Shawshank Rede… 10 7 0.700
# 2 Star Wars IV - A N… 15 8 0.533
# 3 Gladiator 12 6 0.500
# 4 Blade Runner 9 4 0.444
# 5 The Silence of the… 16 7 0.438
# end of script #
library(data.table)
setDT(dfAux1)[, .(pct = sum(Rating>=4)/.N), by=movie][order(-pct)][1:5]
movie pct
1: The Shawshank Redemption 0.7000000
2: Star Wars IV - A New Hope 0.5333333
3: Gladiator 0.5000000
4: Blade Runner 0.4444444
5: The Silence of the Lambs 0.4375000
dfAuxhigh=filter(dfAux1,Rating>=4)%>%group_by(movie)%>%summarize(percentHigh=n())
dfAux=dfAux1%>%group_by(movie)%>%summarize(percentAll=n())
result<-merge(dfAuxhigh,dfAux,by="movie")%>%mutate(percentage=percentHigh/percentAll)
result<-result[order(result$percentage,decreasing = T)[1:5],c(1,4)]
library(tidyverse)
df %>%
group_by(movie, Rating) %>%
summarise(n = n()) %>% #< get freq of movies
mutate(freq = n/sum(n)) %>% #< find perc for each rating, by movie
filter(Rating >=4) %>% #< filter for desired rating (4 or above)
summarise(freq = sum(freq)) %>% #< summarize again
top_n(5) %>%
arrange(desc(freq)) %>%
mutate(freq = paste0(round(freq*100, 2), "%"))
#> movie freq
#> 1 The Shawshank Redemption 70%
#> 2 Star Wars IV - A New Hope 53.33%
#> 3 Gladiator 50%
#> 4 Blade Runner 44.44%
#> 5 The Silence of the Lambs 43.75%