R：如何基于行字符串创建集群_R_Nlp_N Gram_Lemmatization

R：如何基于行字符串创建集群

r nlp

R：如何基于行字符串创建集群,r,nlp,n-gram,lemmatization,R,Nlp,N Gram,Lemmatization,我试图根据每行的字符串值从数据创建集群。我在用R语言。我所说的“集群”是一个大的主题（=家族），可以定义每个关键词。我想象一些基于关键字的自动生成的东西，可能是通过使用柠檬化或ngram 例如，关键字“云服务”和“云服务”都应该在“服务”集群中这是我的输入向量： keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service", "free c

我试图根据每行的字符串值从数据创建集群。我在用R语言。我所说的“集群”是一个大的主题（=家族），可以定义每个关键词。我想象一些基于关键字的自动生成的东西，可能是通过使用柠檬化或ngram

例如，关键字“云服务”和“云服务”都应该在“服务”集群中

这是我的输入向量：

keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service", 
        "free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition", 
        "amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")

目标是清理“关键字”列中的数据，并自动提取一种lemm或ngram

以下是我目前所做的：

根据关键字栏创建“主题”栏：

keywords_df <- mutate(keywords_df,Thematic=Keyword)
keywords_df$Thematic <- as.character(keywords_df$Thematic)

关键字\u df您可以使用grepl（）
检查某些单词的存在，如storage
、computing
和service
。这样，您可以在df
中检查给定单词的存在：
fams   <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))

for(fam in fams){
  family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
#      Keywords                      family     
# [1,] "cloud storage"               "storage"  
# [2,] "cloud computing"             "computing"
---
#[13,] "cloud computing services"    "service"  
#[14,] "benefits of cloud computing" "computing"


Edit2:我看到了您最近的编辑，表明您正在查找非预先指定的族描述。在这种情况下，我想到的第一种方法是（LDA——不要与线性判别分析混淆）
LDA分析文档语料库并将潜在主题识别为词的分布（如下面的terms（LDA.output）
），并识别哪个文档属于哪个主题（如下面的topic（LDA.output）
）：
库（topicmodels）
图书馆（tm）
#初步文本挖掘
语料库您是否有一个“主题族”列表来对关键字
进行分组，还是应该假设这些族纯粹来自关键字
向量（监督或非监督）？谢谢您的回复，我没有“主题族”列表。该族应基于柠檬化、ngram或其他NLP技术，从“关键字”向量创建。NB 1：关键字可以生成几个主题（用逗号分隔）NB 2：脚本应该考虑复数、性别等作为一个主题（EX:小USB密钥），什么是最小的USB密钥？和最小的USB密钥应该给出“小的主题”。@雷米：我编辑了我的旧答案，以包括你的更新/澄清问题。这似乎不是OP想要的。我想OP需要某种无监督的聚类方法，同时为每个聚类分配最频繁的单词fams
不应该给出，而是从关键字本身派生出来的。哇，我现在看到编辑了。我在一个完全不同的舞会上park@useR我在我的答案中添加了一个无监督的“聚类”方法，如果你感兴趣的话，我记得我在研究生院学过这个。谢谢你的复习！谢谢，这正是我想要的！
stopwords_list<-(c("cloud")) #Remove the main word
stopwords <- stopwords(kind = "en")
stopwords <- append(stopwords,stopwords_list)
x  = keywords_df$Thematic        
x  =  removeWords(x,stopwords)
keywords_df$Thematic <- x  

fams   <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))

for(fam in fams){
  family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
#      Keywords                      family     
# [1,] "cloud storage"               "storage"  
# [2,] "cloud computing"             "computing"
---
#[13,] "cloud computing services"    "service"  
#[14,] "benefits of cloud computing" "computing"

library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)

library(topicmodels)
library(tm)

# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))

lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1  Topic 2  Topic 3 
# "servic" "comput" "storag"

cbind(df, terms(lda.output)[topics(lda.output)])
#            df                                    
#Topic 3 "cloud storage"               "storag"
#Topic 2 "cloud computing"             "comput"
#Topic 3 "google cloud storage"        "storag"
#Topic 1 "cloud services"              "servic"
#Topic 3 "free cloud storage"          "storag"
#Topic 2 "what is cloud computing"     "comput"
#Topic 3 "best cloud storage"          "storag"
#Topic 1 "cloud computing definition"  "servic"
#Topic 1 "amazon cloud services"       "servic"
#Topic 3 "cloud service providers"     "storag"
#Topic 2 "google cloud services"       "comput"
#Topic 2 "google cloud computing"      "comput"
#Topic 1 "cloud computing services"    "servic"
#Topic 2 "benefits of cloud computing" "comput"