R能否自动识别并计算n列中某个单词出现的次数？_R

R能否自动识别并计算n列中某个单词出现的次数？

R能否自动识别并计算n列中某个单词出现的次数？,r,R,这是一个沉重的问题，但我会尽力解释。我正试图编写一个程序，跟踪一种昆虫在一段时间内访问一种花的次数。为此，我有一个类似以下内容的数据集： ID Visit_Freq Visitor_1 Visitor_2 Visitor_3 Visitor_4 Visitor_5 1 1.0000000 Halictidae <NA> <NA> <NA> <N

这是一个沉重的问题，但我会尽力解释。我正试图编写一个程序，跟踪一种昆虫在一段时间内访问一种花的次数。为此，我有一个类似以下内容的数据集：

ID          Visit_Freq   Visitor_1   Visitor_2   Visitor_3   Visitor_4   Visitor_5
1             1.0000000  Halictidae       <NA>       <NA>       <NA>       <NA>
2             5.0000000  Syrphidae Halictidae  Syrphidae  Syrphidae       Apis
3             1.0000000        Apis       <NA>       <NA>       <NA>       <NA>
4             0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>
5             0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>
6             0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>
7             0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>
8             2.0000000        Apis       Apis       <NA>       <NA>       <NA>
9             0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>
10            0.0000000        <NA>       <NA>       <NA>       <NA>       <NA>

ID访问频率访客1访客2访客3访客4访客5
1.0000000石斑鱼科
250万食蚜蝇科Halictidae食蚜蝇科
3.1.0000000 API
4             0.0000000                                    
5             0.0000000                                    
6             0.0000000                                    
7             0.0000000                                    
8200000000个API
9             0.0000000                                    
10            0.0000000

在“Visitor_n”栏下，我记录了一种昆虫曾造访过那朵花，或者一种不造访的昆虫。为了分析我们的数据，我们必须在访客栏中统计每一种昆虫的出现次数。有时，我们可以有多达10名访客参观一朵花（ID），而我们的ID计数通常超过500，因此手工计算出现的次数可能是一件苦差事。以下是我为使其更简单所做的工作：

Apis <- sum(apply(DataSet[3:7], 2, function(x) length(which(x == 'Apis'))))

api此解决方案假设昆虫名称只有英文字母，没有数字，第一个字母是大写，其余字母是小写
data.frame(table(grep("[A-Z]{1}[a-z]+",stack(df1)[,1],value=TRUE)))
        Var1 Freq
1       Apis    4
2 Halictidae    2
3  Syrphidae    3

资料
df1只需用我们的昆虫名称创建一个载体
insects <- c( "Apis", "Halictidae", "Syrphidae" )

for( i in insects ) 
    count <- c( count, sum( apply( DataSet[ 3:7 ], 2, 
                       function( x ) length( which( x == i) ) ) ) )
count
[1] 4 2 3

昆虫对于这类问题，我喜欢dplyr
，因为只要数据格式正确（整齐），问题就可以在一行中解决。要将数据转换为整洁的格式，我们需要多行代码（使用packagetidyr
中的gather（）
）
我使用的是user227710定义的数据帧，注意它包含字符串“”，而不是正确的R NAs，因此过滤NAs的行看起来有点奇怪
实际工作由函数group\u by（）
和tally（）
完成。您告诉R数据应该如何分组（此处按Species
变量），然后tally（）
对它们进行计数
我知道，您不想使用外部软件包，但是学习如何使用tidyr
和dplyr
对于任何定期争论数据的人来说绝对值得
require(tidyr) # for gather()
require(dplyr) # for group_by() and tally()

# convert table into tidy (long) format
df_long <- gather(df1, Visitor, Species, Visitor_1:Visitor_5)
head(df_long)
##   ID Visit_Freq   Visitor    Species
## 1  1          1 Visitor_1 Halictidae
## 2  2          5 Visitor_1  Syrphidae
## 3  3          1 Visitor_1       Apis
## 4  4          0 Visitor_1       <NA>
## 5  5          0 Visitor_1       <NA>
## 6  6          0 Visitor_1       <NA>

# now count species, excluding the <NA> value
group_by(df_long, Species) %>%
    filter(Species != "<NA>") %>% 
    tally()
## Source: local data frame [3 x 2]
## 
##      Species  n
## 2       Apis  4
## 3 Halictidae  2
## 4  Syrphidae  3

require（tidyr）#用于聚集（）
要求（dplyr）#对于组_by（）和理货（）
#将表格转换为整洁（长）格式
df_long%
过滤器（种类！=“”）%>%
理货
##来源：本地数据帧[3 x 2]
## 
##物种n
##2 API 4
##3卤虫科2
##4食蚜蝇科3
表（未列出（数据集[，grep（'Visitor'，名称（数据集）））
？您遇到的根本问题是数据不整洁。你应该只有4列，ID，访问频率，访客编号，种类。然后，在这种格式中，您要做的是在dplyr中轻松地使用诸如summary
或tally之类的函数。如果您可以发布生成您正在处理的类型的数据帧的代码，那么我很高兴向您展示如何将其重新排列为整洁的格式，然后进行总结。@ClausWilke这怎么不是“整洁的？”这就是我们在业界通常称为“宽”格式的内容。有“宽”格式和“长”格式。你似乎在暗示“长”格式是“整洁的”，因此是“正确的”。也许如果一个人的技能集只存在于哈德利宇宙中，那么是的，使用长格式会更好，但我们不要建议人们以这种方式限制自己，好吗？@rawr这里，“整洁”是一个技术术语，正如韦翰的论文中所定义的那样。这与格式是否正确无关。显然，长表和宽表都包含相同的信息，因此它们同样正确。然而，在许多情况下，长表的分析要简单得多，这仅仅是因为我们有更好的工具来处理长表而不是宽表。另外，在这个特殊的情况下，如果事先不知道访客的数量，那么一张宽桌子似乎是一个特别糟糕的选择，这会导致大量的NAs。@ClausWilke--我当然不会介意。我现在在一个没有地方的地方，在很多时候几乎没有互联网接入，所以我可能会有点难以回到大家身边。我很欣赏大家的回答，他们都很有见解。
insects <- unique( unlist( DataSet[ 3:7 ] ) )
insects <- insects[ -( which ( insects == "<NA>" ) ) ]

count <- NULL

for( i in insects ) 
    count <- c( count, sum( apply( DataSet[ 3:7 ], 2, 
                       function( x ) length( which( x == i) ) ) ) )
count
[1] 4 2 3

insectCount <- data.frame( insects, count )
insectCount
     insects count
1       Apis     4
2 Halictidae     2
3  Syrphidae     3

require(tidyr) # for gather()
require(dplyr) # for group_by() and tally()

# convert table into tidy (long) format
df_long <- gather(df1, Visitor, Species, Visitor_1:Visitor_5)
head(df_long)
##   ID Visit_Freq   Visitor    Species
## 1  1          1 Visitor_1 Halictidae
## 2  2          5 Visitor_1  Syrphidae
## 3  3          1 Visitor_1       Apis
## 4  4          0 Visitor_1       <NA>
## 5  5          0 Visitor_1       <NA>
## 6  6          0 Visitor_1       <NA>

# now count species, excluding the <NA> value
group_by(df_long, Species) %>%
    filter(Species != "<NA>") %>% 
    tally()
## Source: local data frame [3 x 2]
## 
##      Species  n
## 2       Apis  4
## 3 Halictidae  2
## 4  Syrphidae  3