r文本挖掘:寻找字符模式的频率

r文本挖掘:寻找字符模式的频率,r,data-mining,text-mining,R,Data Mining,Text Mining,我试图在一个大数据集中找到字符模式(单词部分)的频率 例如,我在csv文件中有以下列表: AppleStrawBerryTime 压花石灰 菠萝番石榴 基维格瓦瓦 格斗 混合浆果 奇异果菠萝 利美西德贝里 有没有办法找到所有字符组合的频率?比如: 阿普贝里 番石榴 苹果草莓 基维格瓦瓦 格斗 吸管 应用程序 美联社 假发 记忆 去 更新:这就是我在数据中查找长度为3的所有字符模式的频率的方法: threecombo <- do.call(paste0,expand.grid(re

我试图在一个大数据集中找到字符模式(单词部分)的频率

例如,我在csv文件中有以下列表:

  • AppleStrawBerryTime
  • 压花石灰
  • 菠萝番石榴
  • 基维格瓦瓦
  • 格斗
  • 混合浆果
  • 奇异果菠萝
  • 利美西德贝里
有没有办法找到所有字符组合的频率?比如:

  • 阿普贝里
  • 番石榴
  • 苹果草莓
  • 基维格瓦瓦
  • 格斗
  • 吸管
  • 应用程序
  • 美联社
  • 假发
  • 记忆
更新:这就是我在数据中查找长度为3的所有字符模式的频率的方法:

threecombo  <- do.call(paste0,expand.grid(rep(list(c('a', 'b', 'c', 'd','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')), 3)))

threecompare<-sapply(threecombo, function(x) length(grep(x, myData)))

threecombo您最初的问题对于
grep
/
grepl
来说是一个简单的任务,我看到您已经将我的答案的这一部分纳入了您的修订问题中

docs <- c('applestrawberrylime', 'applegrapelime', 'pineapplemangoguava',
          'kiwiguava', 'grapeapple', 'mixedberry', 'kiwiguavapineapple',
          'limemixedberry')

patterns <-  c('appleberry', 'guava', 'applestrawberry', 'kiwiguava', 
               'grapeapple', 'grape', 'app', 'ap', 'wig', 'mem', 'go')

# how often does each pattern occur in the set of docs?
sapply(patterns, function(x) sum(grepl(x, docs)))

这可以很快地工作和运行,但随着文档库的增长,您可能会陷入困境(即使在这个简单的示例中,也有625个独特的模式)。可以对所有的
s/lappy
调用使用并行处理,但仍然…

因为您可能正在从一组包含非水果词的文本中寻找水果口味的组合,所以我编写了一些类似于您示例中的文档。我使用了quanteda包来构建文档术语矩阵,然后根据包含水果词的ngrams进行过滤

docs <- c("One flavor is apple strawberry lime.", 
          "Another flavor is apple grape lime.", 
          "Pineapple mango guava is our newest flavor.",
          "There is also kiwi guava and grape apple.", 
          "Mixed berry was introduced last year.", 
          "Did you like kiwi guava pineapple?",
          "Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")

require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 7 documents
##    ... indexing features: 90 feature types
##    ... removed 47 features, from 176 supplied (glob) feature types
##    ... complete. 
##    ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
##                apple                 guava                 grape             pineapple                  kiwi 
##                    3                     3                     2                     2                     2 
##           kiwi_guava                 berry           mixed_berry            strawberry      apple_strawberry 
##                    2                     2                     2                     1                     1 
##      strawberry_lime apple_strawberry_lime           apple_grape            grape_lime      apple_grape_lime 
##                    1                     1                     1                     1                     1 
##      pineapple_mango           mango_guava pineapple_mango_guava           grape_apple       guava_pineapple 
##                    1                     1                     1                     1                     1 
## kiwi_guava_pineapple      lime_mixed_berry 
##                    1                     1 

docs欢迎来到Stackoverflow!你的问题很有趣,但很难回答。真的,当有明确的问题时,这个网站工作得更好。在您的例子中,您可能希望提供一个到单词库的链接,然后显示您尝试使用的一些代码,以及您在该代码中遇到的问题。看一些提示!谢谢我用我的代码更新了我的问题,我喜欢你的答案,但是如果存储在文档中的项目不是用空格分隔的话,它怎么工作呢?Docs不确定您要问什么,但尝试在更新的答案中涵盖这两种可能性。这很接近,但我正在寻找一些在我不知道所有可能的模式/组合时可以帮助我的东西。您尝试了什么?想一想你能提供什么样的“模式”内容来获得想要的结果。此外,考虑到一个大的文本语料库,“所有可能的字符组合”将是非常巨大的,SETI刚刚更新了我的问题,我有到目前为止。我意识到数据集是巨大的,但我不知道所有可能的模式是什么。我试图发现最常见的模式是什么。
docs <- c("One flavor is apple strawberry lime.", 
          "Another flavor is apple grape lime.", 
          "Pineapple mango guava is our newest flavor.",
          "There is also kiwi guava and grape apple.", 
          "Mixed berry was introduced last year.", 
          "Did you like kiwi guava pineapple?",
          "Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")

require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
##    ... lowercasing
##    ... tokenizing
##    ... indexing documents: 7 documents
##    ... indexing features: 90 feature types
##    ... removed 47 features, from 176 supplied (glob) feature types
##    ... complete. 
##    ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
##                apple                 guava                 grape             pineapple                  kiwi 
##                    3                     3                     2                     2                     2 
##           kiwi_guava                 berry           mixed_berry            strawberry      apple_strawberry 
##                    2                     2                     2                     1                     1 
##      strawberry_lime apple_strawberry_lime           apple_grape            grape_lime      apple_grape_lime 
##                    1                     1                     1                     1                     1 
##      pineapple_mango           mango_guava pineapple_mango_guava           grape_apple       guava_pineapple 
##                    1                     1                     1                     1                     1 
## kiwi_guava_pineapple      lime_mixed_berry 
##                    1                     1 
flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
                       "kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
                       "limemixedberry")

fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
#        features
# docs  applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1                   1              0                   0         0          0          0                  0              0
# text2                   0              1                   0         0          0          0                  0              0
# text3                   0              0                   1         0          0          0                  0              0
# text4                   0              0                   0         1          1          0                  0              0
# text5                   0              0                   0         0          0          1                  0              0
# text6                   0              0                   0         1          0          0                  1              0
# text7                   0              0                   0         0          0          1                  0              1
unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple" 
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"