R：计算数据框中列表中单词的出现率_R_Aggregate_Plyr

R：计算数据框中列表中单词的出现率

R：计算数据框中列表中单词的出现率,r,aggregate,plyr,R,Aggregate,Plyr,我有一个带有Category和pd的数据框。我需要计算所有pd中每个有意义的单词在每个类别中出现的次数。我被最后一步——总结难住了。理想情况下，该频率与pdbyCategory的总长度之比将是另一个X列例如： freq = structure(list(Category = c("C1", "C2" ), pd = c("96 oz, epsom salt 96 oz, epsom bath salt", "17 x 24 in, bath mat")), .Names

我有一个带有

Category

和

pd

的数据框。我需要计算所有

pd

中每个有意义的单词在每个

类别中出现的次数。我被最后一步——总结难住了。理想情况下，该频率与pd
byCategory
的总长度之比将是另一个X列
例如：
freq = structure(list(Category = c("C1", "C2"
), pd = c("96 oz, epsom salt 96 oz, epsom bath salt", 
          "17 x 24 in, bath mat")), .Names = c("Category", "pd"), row.names = c(NA, 
                                                                                -2L), class = "data.frame")

pool = sort(unique(gsub("[[:punct:]]|[0-9]","", unlist(strsplit(freq[,2]," ")))))
pool = pool[nchar(pool)>1]

freq
：
    Category    pd
1   C1  96 oz, epsom salt 96 oz, epsom bath salt
2   C2  17 x 24 in, bath mat

池
：
[1] "bath"  "epsom" "in"    "mat"   "oz"    "salt" 

期望输出：
pool C1freq C1ratio C2freq C2ratio
bath 1 1/7 1 1/3
epsom 2 2/7 0 0
in 0 0 1 1/3
mat 0 0 1 1/3
oz 2 2/7 0 0
salt 2 2/7 0 0

其中，例如7
是删除了数字和标点符号的C1[，2]
的长度（如pool
规则）<代码>1/7

当然在这种形式中是不必要的-这里它只是显示分母长度

如果可能，不带dplyr或qdap。谢谢

我们可以试试

library(qdapTools)
library(stringr)
lst <- str_extract_all(freq$pd, '[A-Za-z]{2,}')
m1 <- t(mtabulate(lst))
m2 <-  prop.table(m1,2)
cbind(m1, m2)[,c(1,3,2,4)]

你可以考虑用以下方式适应当前的方法：

tab <- table(
  stack(
    setNames(
      lapply(strsplit(gsub("[[:punct:]]|[0-9]", "", freq$pd), "\\s+"), 
             function(x) x[nchar(x) > 1]), freq$Category)))

非常感谢。不知道我做错了什么，但我不明白fractions@AlexeyFerapontov，您应该能够直接从示例数据运行此操作。你在最后做了

cbind

步骤吗？我做了。我得到5列：2和4，3和5是相同的，是计数。在你的例子中，3和5是分数。抄袭/paste@AlexeyFerapontov，也许有什么东西干扰了你的工作环境？请参阅以获取此示例数据的验证。不确定。我开始了新的会议。仍然是sameThank you@akrun。有没有可能以简单的方式保留原始

freq

的名称而不是generic

V1

tab <- table(
  stack(
    setNames(
      lapply(strsplit(gsub("[[:punct:]]|[0-9]", "", freq$pd), "\\s+"), 
             function(x) x[nchar(x) > 1]), freq$Category)))

cbind(tab, prop.table(tab, 2))
#       C1 C2        C1        C2
# bath   1  1 0.1428571 0.3333333
# epsom  2  0 0.2857143 0.0000000
# in     0  1 0.0000000 0.3333333
# mat    0  1 0.0000000 0.3333333
# oz     2  0 0.2857143 0.0000000
# salt   2  0 0.2857143 0.0000000